Sometimes, a cluster node may stop responding ("hang").
Certain conditions, such as thread deadlocks or memory leaks, may deprive
user-mode processes of resources that they must have to function correctly.
These conditions may also prevent user mode processes from running. This may
cause the programs or services on the cluster node to stop servicing client
requests. Because cluster node health monitoring is performed at the kernel
level, and because kernel components may continue to function in these cases, a
cluster node whose user-mode processes have stopped responding may still appear
to be a fully functioning cluster node. The unresponsive cluster node becomes
unavailable to the end user, but it does not fail over because the other
cluster nodes cannot detect a failure in the user mode space.
The
following symptoms typically indicate that the cluster node has stopped
responding:
- You can confirm IP connectivity to the server that is
hanging by pinging it.
- You cannot successfully establish a connection to the
server by using the net use command.
- You cannot successfully connect to the server by using a
Terminal Services client.
- You can move the mouse pointer when you log on locally to
the server.
- You cannot start programs or utilities when you are logged
on locally to the server.
Note: Although other issues may cause some of the previous symptoms,
this combination of issues generally indicates that the server has stopped
responding.
"Hang" detection in Cluster service
The Windows Cluster service incorporates a limited detection
mechanism that may detect unresponsiveness in user-mode components. ClusNet
monitors the health of ClusSvc based on periodic communication between the
user-mode ClusSvc.exe program and the kernel-mode ClusNet driver. Periodic
communication between the user-mode ClusSvc.exe program and the kernel-mode
ClusNet driver is the
heartbeat. The Cluster service in Windows Server 2003 and Windows 2000 SP4
has two new properties that control the behavior of the heartbeat:
- ClusSvcHeartbeatTimeout
This property controls how long the ClusNet driver waits
between ClusSvc heartbeats before it determines that ClusSvc has stopped
responding. By default, the value for this property is 60 seconds. - HangRecoveryAction
This property controls the action to take if the
user-mode processes have stopped responding. By default, the Cluster service
stops. This causes cluster resources to fail over to other cluster
nodes.
How to turn on Cluster service "hang" detection
The Cluster service processes the changes to these cluster
properties only during the initialization of the Cluster service. Therefore,
you must stop and then restart the Cluster service on each node to make sure
that the new policies take effect. To minimize resource downtime, restart the
Cluster service on the cluster nodes one node at a time.
ClusSvcHeartbeatTimeout
To configure how much time elapses after ClusNet determines that
ClusSvc is unresponsive, set the value of the
ClusSvcHeartbeatTimeout property. The heartbeat is set according to the following
formula:
ClusSvcHeartbeartTimeout in seconds/4
For example, if you set the
ClusSvcHeartbeatTimeout property to 60 seconds, the heartbeat is sent every 15 seconds
(60 seconds divided by 4).
The ClusNet driver maintains a countdown
timer that initiates the
HangRecoveryAction property when it reaches 0 (zero). Whenever the ClusNet driver
receives a ClusSvc heartbeat, the countdown time is reset to the
ClusSvcHeartbeatTimeout property. Additionally, when the Cluster service stops for any
reason, the ClusNet driver automatically turns off the countdown
timer.
To set the value of the
ClusSvcHeartbeatTimeout property, run the following command from a command prompt:
cluster.exe /cluster:clustername /prop clussvcheartbeattimeout=number of seconds
where
clustername is the name of the
cluster and
number of seconds is the number of
seconds that you want to use in the calculation of the heartbeat.
HangRecoveryAction
When the ClusNet driver countdown timer reaches 0 (zero), the
HangRecoveryAction property is initiated. You can set the
HangRecoveryAction property to one of the following numeric values:
- 0 (zero): Disables the heartbeat and monitoring
mechanism.
- 1: Logs an event in the system log of the Event Viewer.
- 2: Terminates the Cluster Service. This is the default
setting.
- 3: Causes a Stop error (Bugcheck) on the cluster node.
To set the value of the
HangRecoveryAction property, run the following command at a command prompt:
cluster.exe /cluster:clustername /prop hangrecoveryaction=n
where
clustername is the name of the
cluster and
n is the number that corresponds to the
action that you want to occur if the ClusNet driver countdown timer reaches 0
(zero).
Note In some extreme cases, system services may also stop responding,
and actions
1 and
2 in the earlier list may not succeed. In such cases, action
3 (bugcheck) is the only effective recovery measure.
If
the action is set to cause a bugcheck on the cluster node, Windows stops
responding and you receive the Stop error Bugcheck code of 0x9E. The Stop error
causes a failover to another cluster node. Additionally, if the node where the
Stop error occurs is configured to capture a memory dump file, you may be able
to use the information that is contained in the memory dump file to diagnose
the cause of the unresponsive cluster node. The following code is an example of
a stack trace from a Kernel dump that the ClusNet driver initiated:
ChildEBP RetAddr
f9c33ea8 f6e2e11f nt!KeBugCheckEx+0x19
f9c33ecc f6e2e836 clusnet!CnpCheckClussvcHang+0xef
f9c33ef0 805070d7 clusnet!CnpHeartBeatDpc+0x47e
f9c33fa4 8050735d nt!KiTimerExpiration+0x371
f9c33ff4 80543ccf nt!KiRetireDpcList+0x63
The Bugcheck error code is similar to the following
error code:
BugCheck 9E, {812d5b08, 3c, 0, 0}
Important You must manually configure the server to generate a memory dump
file in response to a Bugcheck.
Windows 2000 service pack information
To resolve this problem, obtain the latest service pack for Windows 2000. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
260910�
How to obtain the latest Windows 2000 service pack
Windows 2000 hotfix information
A supported hotfix is available from Microsoft. However, this hotfix is intended to correct only the problem that is described in this article. Apply this hotfix only to systems that are experiencing this specific problem.
If the hotfix is available for download, there is a "Hotfix download available" section at the top of this Knowledge Base article. If this section does not appear, submit a request to Microsoft Customer Service and Support to obtain the hotfix.
Note If additional issues occur or if any troubleshooting is required, you might have to create a separate service request. The usual support costs will apply to additional support questions and issues that do not qualify for this specific hotfix. For a complete list of Microsoft Customer Service and Support telephone numbers or to create a separate service request, visit the following Microsoft Web site:
Note The "Hotfix download available" form displays the languages for which the hotfix is available. If you do not see your language, it is because a hotfix is not available for that language.
The English version of this hotfix has the file attributes (or later file attributes) that are listed in the following table. The dates and times for these files are listed in Coordinated Universal Time (UTC). When you view the file information, it is converted to local time. To find the difference between UTC and local time, use the
Time Zone tab in the Date and Time tool in Control Panel.
Date Time Version Size File name
----------------------------------------------------------
12-Mar-2003 14:22 5.0.2195.6683 55,568 Clusapi.dll
12-Mar-2003 14:02 5.0.2195.6683 67,760 Clusnet.sys
12-Mar-2003 14:02 5.0.2195.6683 682,768 Clussvc.exe
12-Mar-2003 14:22 5.0.2195.6660 99,600 Netman.dll
12-Mar-2003 14:22 5.0.2195.6604 477,456 Netshell.dll
12-Mar-2003 14:02 5.0.2195.6683 54,544 Resrcmon.exe
07-Mar-2003 18:41 5.0.2195.6680 3,988,992 Sp3res.dll