The heartbeat process
The exchange of UDP datagrams between nodes in a cluster is known
as the "heartbeat process". By default, heartbeats are sent every 1.2 seconds
from each network interface for each node to each network interface for every
other node that is in the cluster. In Windows Server 2003, multicast datagrams
can be used to reduce the amount of heartbeat traffic that occurs between
cluster nodes. By default, Windows Server 2003 uses multicast datagrams when
three or more nodes are configured in a cluster.
Event IDs 1123 and 1122
Event ID 1123 indicates that node A in the cluster did not
receive a heartbeat from node B in the cluster for two heartbeat intervals over
a specified network interface. That means that node A did not receive a
heartbeat from node B for 2.4 seconds.
Event ID 1122 indicates that
node A received a heartbeat from node B. This communication update is received
after 2.4 seconds but before 4.8 seconds. Event ID 1122 is logged if
communications are re-established over a network interface that was previously
shut down. For example, event ID 1122 occurs when a node that was shut down
rejoins the cluster.
The regroup process
Assume that node A does not receive an update from node B after
six consecutive heartbeats over all network interfaces that are enabled for
internal cluster communications. In this case, node B is assumed to be
inactive. The cluster may perform a "regroup" process. During a regroup
process, the cluster network driver on node A notifies the Membership Manager
and the Node Manager that a failure has occurred. The Membership Manager and
the Node Manager initiate a regroup operation that takes node B offline and
removes it from active membership in the cluster. When this regroup process
occurs, event ID 1126 is logged in the System log. Event ID 1135 may be
subsequently logged in the System log. Event ID 1135 indicates that a node has
been removed from active cluster membership. Messages that are similar to the
following are logged:
Message 1Event ID:
1126
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network' is
unreachable by at least one other cluster node attached to the network. The
cluster was not able to determine the location of the failure. Look for
additional entries in the system event log indicating which other nodes have
lost communication with node ClusterNode. If the
condition persists, check the cable connecting the node to the network. Next,
check for hardware or software errors in the node's network adaptor. Finally,
check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.
Message 2Event ID:
1135
Source: ClusSvc
Description:
Cluster node
ClusterNode was removed from the active cluster
membership. The Clustering Service may have been stopped on the node, the node
may have failed, or the node may have lost communication with the other active
cluster nodes.
Troubleshooting event IDs 1123 and 1122
When event ID 1123 is followed by event ID 1122, you can generally
ignore the events if the following conditions are true:
- There are no coincident failures of cluster IP address
resources, and there are no concurrent resource group failovers.
- The nodes that were removed from cluster membership were
removed only because of a loss of network communication. For example, a node
was removed when the node was shut down or restarted.
Note You can also generally ignore event IDs 1124, 1126, 1127, and
1130 if they occur during a node restart.
Important When event IDs 1126 and 1127 follow event ID 1123, a problem may
exist. Event IDs 1126 and 1127 indicates that all cluster nodes agree that a
network interface is not functioning correctly. In this case, messages that are
similar to the following are logged:
Message 1Event ID:
1126
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network' is
unreachable by at least one other cluster node attached to the network. The
cluster was not able to determine the location of the failure. Look for
additional entries in the system event log indicating which other nodes have
lost communication with node ClusterNode. If the
condition persists, check the cable connecting the node to the network. Next,
check for hardware or software errors in the node's network adapter. Finally,
check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.
Message 2Event Id:
1127
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network' failed. If
the condition persists, check the cable connecting the node to the network.
Next, check for hardware or software errors in node's network adapter. Finally,
check for failures in any network components to which the node is connected
such as hubs, switches, or bridges.
This section describes possible
reasons why you may receive event ID 1123 followed by event ID 1122. Use this
information to evaluate and to troubleshoot these events before you contact
Microsoft support.
Network adaptor teaming
Network adaptor teaming can involve multi-port card or separate
single-port PCI network adaptors.
Note Network adaptor teaming is not supported on the cluster heartbeat
network adaptor.
The following articles discuss network adaptor
teaming with Windows Clustering:
254101 Network adaptor teaming and server clustering
276457 Event success messages 4201 and 1122 using Windows Clustering
Network adaptor driver issues
Network adaptor drivers may be outdated or incorrect.
Additionally, some drivers may not match the drivers on other nodes in the
cluster.
Network device failures
Network devices, such as switch ports or network adaptors, may not
be working correctly. However, if all cluster networks log the same error
message, a network device is unlikely to be the cause. If only one of the
cluster networks logs event IDs 1123 and 1122, you may have one of the
following problems:
- Device configuration mismatches
This problem occurs when settings for the network
adaptor and for the port that the node is attached to do not match. For
example, this problem occurs when a network adaptor is set to Auto Negotiate
and when the switch port is set to 100 megabytes (MB) full-duplex.
Additionally, some network adaptors take over some of the functionality of the
TCP/IP stack. For example, some network adaptors perform flow control and
hardware checksumming. As part of the troubleshooting process, you may have to
configure the network adaptor to return this functionality to the TCP/IP stack.
For more information,
click the following article number to view the article in the Microsoft
Knowledge Base: 174812
The effects of using Autodetect setting on cluster network interface card
- Switch port issues
This problem is identified by connecting the cluster
node to another port. If you connect the node to another port and if event IDs
1123 and 1122 are not repeated, the problem is with the switch port. To
identify this problem, you can also plug the cluster nodes into a network hub
and then uplink the hub to the switch port. Use this method when the following
conditions are true:
- The public network is supported by a switch.
- The private, or heartbeat, network is supported by
either a hub or a crossover cable.
- Switch configuration issues
This problem occurs when the spanning tree protocol
(STP) has been enabled on the port and when the port is no longer in the
forwarding state. Disable this configuration, or enable the rapid spanning tree
protocol (RSTP) if the switch supports it. RSTP reduces the time that the
switch port must use to transition from a blocking state to a forwarding
state. - Virtual local area network (VLAN) issues
This problem occurs when the cluster nodes are part of a
VLAN where the ports reside on different physical switches and when a trunk
link configuration is set up between the switches. To resolve this issue, move
the node connection to a port that is on the same physical switch.
Node resource issues
A node resource problem occurs because the Server service cannot
keep up with incoming or outgoing network connections. The Server service
cannot meet the demand for the network items that are queued by the network
layer of the I/O stream. In this case, a Server service event, such as event ID
2022, may be logged in the System log. A message that is similar to the
following is logged:
Event ID:
2022
Source: Srv
Description:
Server was unable to find a free
connection n times in the last
NumberofSeconds seconds.
In this
situation, deferred procedure call (DPC) requests are queued ahead of the
network requests that are registered by the Interrupt Service Routine (ISR) for
the network device. To troubleshoot this issue, investigate all components of
the I/O path. This includes the network I/O and hard disk I/O. Use System
Monitor to collect this data. For more information about how to troubleshoot
this issue, click the following article number to view the article in the
Microsoft Knowledge Base:
317249
How to troubleshoot event ID 2021 and event ID 2022
DPC requests that occur in a cluster are typically
caused by the following sources:
- SCSI host bus adaptor (SCSI/HBA) network adaptor
drivers.
- Multi-path software drivers, such as PowerPath or
SecurePath.
- Redundant disk array controllers (RDAC).
- Third-party programs, such as backup software or disk quota
software.
You must make sure that all third-party hardware device drivers
are current and that they are supported in a cluster configuration by the
hardware vendor. Additionally, we recommend that you contact your third-party
program vendor for more information about how the third-party software
functions in a clustered environment.
For more information,
click the following article number to view the article in the Microsoft
Knowledge Base:
814607
Microsoft support for server clusters with 3rd party system components
Incorrect software on Windows-2000 based cluster nodes
In a Windows 2000-based cluster, all cluster nodes must be running
Windows 2000 Service Pack 3 or a later version. If your Windows 2000-based
computer logged event ID 2022, view the following articles in the Microsoft
Knowledge Base to resolve this issue:
317249 How to troubleshoot event ID 2021 and event ID 2022
245080 Receiving multiple instances of event ID 2022
If you still experience this issue, apply the hotfix
that is described in the following article in the Microsoft Knowledge Base:
830901 Event ID 2022 is logged and your Windows 2000-based computer may stop responding
Incorrect registry settings
Important This section, method, or task contains steps that tell you how to modify the registry. However, serious problems might occur if you modify the registry incorrectly. Therefore, make sure that you follow these steps carefully. For added protection, back up the registry before you modify it. Then, you can restore the registry if a problem occurs. For more information about how to back up and restore the registry, click the following article number to view the article in the Microsoft Knowledge Base:
322756 How to back up and restore the registry in Windows
To resolve event messages in Windows 2000-based
and Windows Server 2003-based clusters, you may have to make changes to the
following registry subkey on each node:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters
Important In Windows 2000-based clusters, you must install hotfix 872790
before you make these changes.
Add the following DWORD values to the
registry subkey:
Value Name: MaxRawWorkItems
Data Type: REG_DWORD
Value data: 512 (decimal)
Value Name: MaxFreeConnections
Data Type: REG_DWORD
Value data: 4096 (decimal)
Value Name: MinFreeConnections
Data Type: REG_DWORD
Value data: 100 (decimal)
Value Name: MaxWorkItems
Data Type: REG_DWORD
Value data: 6000 (decimal)
To create the MaxWorkItems DWORD value, follow these steps:
- Click Start, click Run,
type regedit, and then click
OK.
- Locate and then click the following registry subkey:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters
- Right-click parameters, point to
New, and then click DWORD Value.
- Type MaxWorkItems, and then press
ENTER.
- Right-click MaxWorkItems, click
Modify, type 6000, click to select the
Decimal option, and then click OK.
Repeat these steps for each new DWORD value, and then restart
your computer.
High kernel-mode CPU usage
To troubleshoot high kernel-mode CPU usage, use System Monitor to
identify the problem. High kernel mode CPU usage may be caused by the following
sources:
- Hardware drivers that use DPC and that compete with the DPC
routines of the cluster heartbeat process.
- Frequent multiple hardware interrupt requests that occur at
the same time.
- Excessive I/O output, such as kernel debug sessions over a
serial connection.
High CPU usage that is caused by SNMP agents
Third-party Simple Network Management Protocol (SNMP) agents that
run in a cluster may periodically contact the NTFS file system on a shared
cluster disk resource. The agents use the
CreateFile function to contact NTFS. This behavior can cause significant CPU
usage when the SNMP agent caches data on a specific volume.
Multicast issues
Multicast issues may occur in a Windows Server 2003 cluster. To
troubleshoot multicast issues, disable multicast support in the
cluster. For
more information about how to disable multicast, click the following article
number to view the article in the Microsoft Knowledge Base:
307962
Multicast support enabled for the cluster heartbeat