When the Cluster service fails a group over, it starts by taking all resources offline, starting with the most dependent and moving to the least dependent. The disk frequently is the last resource taken offline because most resources depend on the disk either explicitly or transitively and the disk typically does not have dependencies on other resources. Because all resources in the cluster that depend on the disk are stopped before a failover is initiated, the handles that those programs have open to the disk are destroyed in the failover process.
If a program uses the shared disks and if it is not a clustered resource, any handles to the shared storage are not gracefully closed when the Cluster service fails the disk over. Because these programs are operating outside the Cluster service, the Cluster service cannot shut the programs down or close the handles gracefully. During a failover, the handles to the disk are orphaned, and the disk may be marked for a
chkdsk command. If open files that have open handles to them are unexpectedly terminated when a disk is being dismounted, the disk may be flagged for a
chkdsk command.
When you bring a disk online, the Cluster service checks to see if a disk has been marked as "dirty" and if it requires a
chkdsk command. If the Cluster service sees that the disk has been marked as "dirty," the Cluster service initiates a
chkdsk command before the disk is brought online. Any resources in the cluster that have down-level dependencies on that disk do not come online until the
chkdsk command has run and the disk is successfully brought online by the Cluster service.
Currently, there no way to tell exactly why the disk was marked as "dirty" on a disk. However, you can use the following troubleshooting steps to determine the device that set the dirty bit on the disk:
- Turn off any filter drivers that you do not require such as open file agents and file scanning utilities.
- Make sure that you are running the correct firmware versions and drivers for your host bus adapters on both nodes. Both nodes must be using the same version of the firmware, BIOS, and drivers on each node to be in a supported environment.
- Check the system logs for hardware errors.
- Turn off any programs and services that you do not require that may try to lock file or maintain handles to a disk.
To determine whether the problem is caused by an open handle to the drive, do this: Run the Handle.exe utility immediately after the issue occurs on the cluster node where the physical disk resource did not come online. (Handle.exe is a Windows Sysinternals tool that is available from
http://technet.microsoft.com/en-us/sysinternals/bb896655.aspx.)
At a command prompt, type the following command, and then press ENTER:
Handle.exe -a -u Drive_letterNote Drive_letter is a placeholder for the drive designation for the cluster drive that did not come online.
For example, assume that the drive designation for the cluster drive that did not come online is drive Q. To run the Handle.exe utility in this scenario, type the following command, and then press ENTER:
Handle.exe -a -u Q:
You can then research the process that has the open handle to the drive.
For more information about 'chkdsk' on a cluster disk, click the following article number to view the article in the Microsoft Knowledge Base:
176970
How to Run the CHKDSK /F Command on a Shared Cluster Disk
272244 Location of the Chkdsk Results for Windows Clustering Resources