Notice: This website is an unofficial Microsoft Knowledge Base (hereinafter KB) archive and is intended to provide a reliable access to deleted content from Microsoft KB. All KB articles are owned by Microsoft Corporation. Read full disclaimer for more details.

Windows Server 2003 Server Cluster with a Generic Script Resource Stops Responding for Long Periods


View products that this article applies to.

Symptoms

In a cluster where there is an active Generic Script resource, the cluster may become unresponsive. Cluster Administrator and Cluster.exe appear to stop responding (hang). The cluster log shows blocked threads inside a Generic Script resource. For example:
000007c4.000007e4::2002/12/12-19:17:03.781 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwQuoBlockingResources for resource f37f58fb-03ff-44b3-a4d7-086b0838d73d
The event log contains a message similar to either of the following:

Event ID: 1232
Event Type: Error
Event Source: ClusSvc
Cluster generic script resource MyScript timed out. Online script entry point did not complete execution in a timely manner. This could be due to an infinite loop or a hang in this entry point, or the pending timeout may be too short for this resource. Please review the Online script entry point to make sure there's no infinite loop or a hang in the script code, and then consider increasing the pending timeout value if necessary. In a command shell, run "cluster res "MyScript" /prop PersistentState=0" to disable this resource, and then run "net stop clussvc" to stop the cluster service. Ensure that any problem in the script code is fixed. Then run "net start clussvc" to start the cluster service. If necessary, ensure that the pending time out is increased before bringing the resource online again.

or

Event ID: 1233
Event Type: Error
Event Source: ClusSvc
Cluster generic script resource MyScript: Request to perform the Online operation will not be processed. This is because of a previous failed attempt to execute the Online entry point in a timely fashion. Please review the script code for this entry point to make sure there is no infinite loop or a hang in it, and then consider increasing the resource pending timeout value if necessary. In a command shell, run "cluster res "MyScript" /pro PersistentState=0" to disable this resource, and then run "net stop clussvc" to stop the cluster service. Ensure that any problem in the script code is fixed. Then run "net start clussvc" to start the cluster service. If necessary, ensure that the pending time out is increased before bringing the resource online again.

↑ Back to the top


Cause

A Generic Script resource script can cause the whole cluster to stop responding or become unresponsive if any of the following conditions exist:
  • The Generic Script resource script contains an infinite loop (and therefore never exits).
  • Calls to certain cluster application programming interfaces (APIs) are occurring. Calls to certain cluster APIs must be avoided from within a resource DLL or resource script because they can cause a cluster-wide deadlock. This script may be calling cluster APIs or starting Cluster.exe (which may result in calling cluster APIs that must be avoided) as one of the steps. For information about APIs that should not be called from a resource DLL or script, see �Function Calls to Avoid in Resource DLLs� in the Microsoft Platform SDK (PSDK).
  • An action the Generic Script resource script is performing takes longer than the pending timeout value.
To avoid an infinite hang situation, the Cluster Resource Monitor refuses to perform any operations (such as Online, Offline, IsAlive, and LooksAlive) on the script after any operation has exceeded the pending timeout value. Any additional attempts to perform Generic Script resource operations on that resource will result in the second event log message that is shown in the "Symptoms" section of this article.

↑ Back to the top


Resolution

The Cluster Resource Monitor will not perform any additional operations on a Generic Script resource after any entry point has exceeded the pending timeout value, but the problematic thread will continue to run. To resolve the problem, disable the resource (that is, prevent it from coming online), stop the Cluster service (this terminates the problematic thread), fix the script problem, and then restart the Cluster service. Depending on the cause of this problem, you may want to increase the online or offline pending timeout value for this resource. For step-by-step instructions, see the "Recover and Restart the Cluster Service� section later in this article.

Changing Pending Timeout Values

Any cluster resource operation should complete execution well inside the range of the pending timeout. For this reason, do not change the timeout value without a thorough understanding of why your script entry point exceeds this period of time. Also, consider all the implications of increasing this value because the cluster will be unresponsive until the timeout value is exceeded.

Recover and Restart the Cluster Service

  1. Disable the resource (in this example, named MyScript) by typing the following command:
    cluster resource "MyScript" /properties PersistentState=0
  2. Stop the Cluster service on the node that currently owns this resource�s group by typing the following command in a console window:
    net stop clussvc
  3. Fix any problem that you identify in the script that causes it to stop responding, loop, or exceed the pending timeout value. You may determine that the appropriate thing to do is to increase the pending timeout value, but make sure that you carefully consider the implications of doing so.
  4. Restart the Cluster service by typing the following command:
    net start clussvc
  5. Bring the resource back online manually by using Cluster Administrator or Cluster.exe. To do so, type the following command:
    cluster resource �MyScript� /online
    Note that bringing the resource back online automatically sets PersistentState to 1, so there is no need for an additional command to change the value from 0.

↑ Back to the top


Status

Microsoft has confirmed that this is a bug in the Microsoft products that are listed at the beginning of this article.

↑ Back to the top


Keywords: KB811685, kbbug

↑ Back to the top

Article Info
Article ID : 811685
Revision : 7
Created on : 2/28/2007
Published on : 2/28/2007
Exists online : False
Views : 455