This article discusses common link state issues and common
routing issues that you may experience in Microsoft Exchange 2000 Server and in
Microsoft Exchange Server 2003.
The purpose of a routing group
The routing group is the smallest unit of servers that are likely
to always be connected to one another. The routing group can be assumed to be
one node on the graph of connector paths, with multiple possible connectors
between routing groups.
To configure the way that messages are routed
between servers so that point-to-point connections between servers are always
made, the servers must be grouped in routing groups, and the Routing Group
connectors must be defined.
In a routing group, link state
information updates and routing information updates are pushed between master
nodes and member nodes through a persistent port 691 Transmission Control
Protocol (TCP) connection. Between two routing groups, servers advertise the
X-LINK2STATE verb to exchange link state information by comparing the MD5
digest in the Exchange organization information packet of the two routing group
bridgeheads. A mismatch triggers an exchange of link state information between
the two servers through SMTP port 25.
The role of a routing group master
The routing group master coordinates changes to link states that
are learned by servers in its routing group and retrieves updates from the
directory service. By having a single server coordinate the changes, you can
treat a routing group as a single entity for the purposes of computing a
least-cost path between routing groups in an organization.
What occurs when the routing group master stops responding
All servers in the routing group continue to operate on the same
information that they had at the time that they lost contact with the master.
When the routing group master comes back up, it examines the status
of all other servers, reconstructs the link state information, processes the
State Change Queue (SCQ), and then updates members in the routing group.
Common issues
The following sections present several routing issues that you
may experience. Additionally, the following sections suggest methods that you
can use to troubleshoot the issues.
Routing member node is not connected to master
When you use the WinRoute tool (Winroute.exe) to view Exchange
organization routing, you may see the words "connected to master - NO" and a
red X next to the organization's name. These words and the red X indicate that
the routing member node is not connected to the master.
In a routing group, the routing group nodes, including the
master, must be connected to the master node on Transmission Control Protocol
(TCP) port 691 to propagate routing information and link state information to
and from the master node.
Note To download the Microsoft Exchange Server 2003 WinRoute tool for
troubleshooting routing in an Exchange 2000 and Exchange 2003 mail-handling
environment, visit the following Microsoft Web site.
The following
file is available for download from the Microsoft Download
Center:
Download
the Winroute.exe package now. For more information about how to
download Microsoft Support files, click the following article number to view
the article in the Microsoft Knowledge Base:
119591 How to obtain Microsoft support files from online services
Microsoft scanned this file for viruses. Microsoft used the most
current virus-detection software that was available on the date that the file
was posted. The file is stored on security-enhanced servers that help prevent
any unauthorized changes to the file.
To resolve this issue, follow these steps:
- Make sure that the Exchange Routing Engine Service (RESvc
service) is started on all affected servers in the routing group and that it
remains in a controlled state. If the service is in an unstable state, the
server may not connect to master nodes. Investigate the root cause of any
unstable services before you go to the next step.
- Verify that a firewall does not restrict TCP port 691. To
do this, initiate a Telnet session to port 691 on the affected servers and on
the master node. A Microsoft Routing Engine banner indicates an active state.
- At the command prompt, run the netstat �a �n command. The output of this command reveals all member nodes and
the master itself connecting to TCP port 691 on the master node.
- In Event Viewer, check the application logs for any events
that indicate a failure to authenticate by using the computer account , such as
Domain\serverName$. Events such as Transport events 962 and 961 indicate a
failure of the RESvc service to connect.
- Verify that the affected servers or the Exchange Domain
Server group that they belong to do not have the SendAs right missing, denied,
or denied from a nested membership of another group. To do this, run the
Exchange Trace Utility (Regtrace.exe), and then restart the RESvc service.
For more information about
RegTrace setup on Exchange 2000, click the following article number to view the
article in the Microsoft Knowledge Base:
238614
How to set up Regtrace for Exchange 2000
Note For additional information about tools and processes that you can
use to troubleshoot and to diagnose transport issues and routing issues in
Exchange 2003, download the Exchange Server 2003 Transport and Routing Guide online book. To download this book, visit the following Microsoft
Web site: - Verify that the affected servers can generate a
ServicePrincipalName (SPN) for authentication. To verify this, check the
network address attribute of the affected servers by using the ADSI Edit tool
(ADSIEdit.exe) or by using the Lightweight Directory Protocol tool (Ldp.exe).
Nodes in a routing group have to mutually authenticate with the
routing group master to be connected. To do this, they use the ncacn_ip_tcp
value in the Network address attribute of the Exchange Server computer to
generate the SPN for the master node by using Kerberos authentication. Make
sure that this value is a Fully Qualified Domain Name (FQDN) instead of a
NetBIOS name or an IP address. Restart the RESvc service. - Check the application log and the system log on all the
affected servers for any Kerberos authentication errors. Kerberos
authentication errors may be caused by an expired domain computer account
password. To gain additional information about this issue, run the NLTEST
utility with debug flags.
For more information about how to run the NLTEST
utility with debug flags, click the following article number to view the
article in the Microsoft Knowledge Base:
109626
Enabling debug logging for the Net Logon service
Important If the domain computer account password has apparently expired,
you must contact Microsoft Product Support Services (PSS) to confirm and to
correct the issue. For a complete list of Microsoft Product Support Services
phone numbers and information about support costs, visit the following
Microsoft Web site: - Verify that the FQDN of the virtual server matches the FQDN
in Domain Name System (DNS).
- If the membership of the routing group spans multiple
domains, make sure that DNS is correctly designed and implemented between the
domains.
- Look for any third-party applications that use Group Policy
objects to restrict permissions or to restrict security settings.
Routing group master wars
In a routing group, the first server installed in the routing
group is automatically elected as the master node. As other servers are
installed, the administrator has the option to appoint another server as
master.
When the new routing group master is elected, only one server
should be assigned the master role at a time. This rule is enforced by an
algorithm that is based on the formula "(
N/2) +1"
(where
N denotes the number of servers in the
routing group). The algorithm calculates the number of servers in the routing
group that must agree and that must acknowledge the master. Therefore, the
member nodes send link state ATTACH data (information about the routing group)
to the master.
It is not uncommon for two or more servers to have
erroneous information about which server is the current routing group master.
For example, if a routing group master was moved or was deleted, and another
master node was not chosen, the MsExchRoutingMasterDN attribute may point to a
non-existent server.
This issue may also occur when an old master
does not detach as master, or when a problematic node keeps sending incorrect
link state ATTACH information.
Note In Microsoft Exchange Server 2003, if a routing group points to a
deleted object, the master node gives up its role as master and initiates a
shutdown.
To resolve this issue, use one of the following methods:
- Look for link state data propagation through TCP port 691,
for firewall hindrances such as firewall blocking of TCP port 691, and for SMTP
filters.
- Look for Active Directory replication latencies.
- Look for network problem and latencies.
- Look for deleted routing group masters or servers that no
longer exist. If this is the case, a Transport event 958 that references a
routing group master distinguished name that no longer exists is logged in the
application log. Use the Lightweight Directory Protocol (Ldp.exe) tool or the
ADSI Edit (Adsiedit.exe) tool to verify that this is the case.
Deleted routing groups are followed by [object_not_found_in_DS]
When servers are moved between routing groups, and when the
routing groups are subsequently deleted, if you use Winroute.exe you may see
the text
[object_not_found_in_DS] next to the object name.
This issue may occur if the routing engine
service tries to correlate an object that still exists in a dynamic routing
library that is maintained by the server with objects in Active Directory,
where the object does not exist any more.
Tips to resolve this issue:
- Restart all servers in the organization at the same time.
This action updates routing information. Additionally, this action removes
deleted routing groups and deleted connectors.
- Use the Remonitor.exe tool in injection mode.
Note Contact Microsoft Product Support Services for information about
the Remonitor.exe tool in injection mode. For a complete list of Microsoft
Product Support Services phone numbers and information about support costs,
visit the following Microsoft Web site: - Make sure that the servers are on a recent build of
Exchange Server and that they have the Exchange Server service pack rollups
installed.
Note Applying the hotfix that is described in the following Knowledge
Base article is no longer necessary if your servers are on a recent build of
Exchange Server and have the current Exchange Server service pack rollups
installed. If you cannot install the most recent Exchange Server service pack
rollups, apply the hotfix that is described in the following Knowledge Base
article:330279 Deleted routing groups are listed in the WinRoute tool; fix requires Exchange 2000 SP3
- Restart all Exchange Server services and Windows Management
Instrumentation (WMI) services on all Exchange Server computers in the
organization. This resolution is effective only if all servers are restarted at
the same time.
Note Contact Microsoft Product Support Services for information about
restarting all servers at the same. For a complete list of Microsoft Product
Support Services phone numbers and information about support costs, visit the
following Microsoft Web site: - Make sure that the account that is logged on to the server
has sufficient permissions. To do this, run Winroute.exe under the System
Account.
Note The lack of sufficient read permissions may cause Winroute.exe to
incorrectly report [object_not_found_in_DS].
Connectors are not reported to be marked as "DOWN"
When you use the Winroute.exe tool to view Exchange routing
topology, you may see that connectors that are unavailable are reported as
being available ( they are marked as "UP"). This behavior may occur for the
following connectors:
- Connectors that use DNS to route. For example, this
behavior may occur for SMTP connectors that use DNS instead of smart
host.
- Microsoft Exchange 5.5 Server connectors or Exchange
Development Kit (EDK) connectors. These connectors do not use link state
routing.
- Routing group connectors with source bridgeheads of the
"any" type.
- Any connectors where one bridgehead is an Exchange 5.5
Server computer.
- Connectors that use smart host settings and recently
changed smart hosts.
Link state oscillations: connectors are repeatedly marked as "UP" and then as "DOWN"
This common scenario involves connectors being marked as "UP" and
then as "DOWN" repeatedly. It causes excessive link state updates between
servers. These excessive link state updates cause a very expensive and frequent
recalculation of routes within the server. This is also indicated by Event 4005
Reset Routes. This issue may occur in the following scenarios:
- Network problems. Use a network trace to diagnose this
scenario.
- A reaction to link status notification calls from
underlying protocol services, such as SMTP/AQ and message transfer agent (MTA).
This behavior is caused by interference on the X.400 protocol levels or on the
SMTP protocol levels by third-party applications.
In this scenario,
only a network monitor capture can reveal the issues that are involved.
Additionally, if you notice very frequent changes of the major versions, of the
minor versions, and of the user versions in the WinRoute tool, this may also
indicate a link state problem (see the WinRoute routing version changes section).
To reduce link state oscillations, apply the hotfix that is
described in the following article in the Microsoft Knowledge Base:
825314 Link state traffic saturates slow links between servers
After the hotfix has been applied, you must enable
the AttachedTimeout registry subkey to make sure that the hotfix works as
expected.
Important This section, method, or task contains steps that tell you how to modify the registry. However, serious problems might occur if you modify the registry incorrectly. Therefore, make sure that you follow these steps carefully. For added protection, back up the registry before you modify it. Then, you can restore the registry if a problem occurs. For more information about how to back up and restore the registry, click the following article number to view the article in the Microsoft Knowledge Base:
322756 How to back up and restore the registry in Windows
To enable the AttachedTimeout registry value,
follow these steps:
- Click Start, click Run,
type regedit, and then click
OK.
- Locate the
HKLM\SYSTEM\CurrentControlSet\Services\RESvc\Parameters
subkey. - Right-click the Parameters subkey, point
to New, and then click DWORD
value.
- Name the new value
AttachedTimeout.
- Double-click AttachedTimeout, and then
type any data value from 1 to
604800. Click to select Decimal for the
Base type.
Note The AttachedTimeout value represents time in seconds. The valid
range for this value is 1 second to 604,800 seconds (7 days). - Click OK, and then quit Registry
Editor.
Note Contact Microsoft Product Support Services for more information
about the AttachedTimeout registry subkey. For a complete list of Microsoft
Product Support Services phone numbers and information about support costs,
visit the following Microsoft Web site:
How connector states affect link states
A connector can be located anywhere in any routing group in the
Exchange organization. A specific connector that is frequently marked as "UP"
and as "DOWN" may seriously affect the possible routes that a message can take
through the organization. Such a connect may even lead to mail loops.
Exchange routing chooses the most optimal path, based on variables
such as cost, message type, and restrictions. Exchange routing locates the next
server for a message to make the next hop to, and then Exchange routing gives
the name of the next server to Message Queuing. Because the oscillating state
of a connector causes link state changes, Exchange has to repeatedly
recalculate the optimal path. This recalculation process involves queries to
the directory service.
How link states affect connector states
When Message Queuing detects that a link to the bridgehead server
on a connector failed, it calls into routing by using a method that is named
LinkStateNotify( ). Routing then suppresses this information for up to 10
minutes to prevent connector state fluctuation, and then routing relays this
information to the routing group master. If routing decides to mark the
connector as "DOWN," this change is propagated to all computers in the
organization, including the computer where the original failure occurred. This
behavior leads to a very expensive process that is named "reset routes."
Thereafter, the routing engine no longer recommends that the Advanced Queuing
engine (AQ) connect to the "failed" next-hop computer. The reverse is true for
a connector that is marked as "UP."
WinRoute routing version changes
The WinRoute tool reports routing versions in the following
format: "RoutingGroup (d5.2.3)." The three numbers that are separated by
periods that follow the routing group name are the major version, the minor
version, and the user version.
Major version changes are typically
changes in directory service that involve routing and connectors. If there is a
frequent change here, monitor it by using the Remonitor.exe tool, and then
investigate it for a probable root cause. For example, an administrator may
make significant changes in directory service. A major version of zero is shown
for isolated routing groups with no routing and no link state exchange with
other nodes. Additionally, a major version of zero is shown for Microsoft
Exchange 5.5 Server-based sites because they do not use link state information.
A minor version change may indicate changes to the state of a
connector. Frequent changes may be caused by faulty links or by links that
fluctuate between the "UP" state and the "DOWN" state. AQ tries to send a
message over a connector. If AQ fails, it sends a notification to routing to
mark the connector as "DOWN." Then, AQ initiates retry pings to the connector.
After AQ detects that the connector is up, AQ notifies routing by calling the
LinkStateNotify() method.
User version changes may occur in the
following situations:
- Servers attach to or detach from master nodes.
- WMI services send data to the routing group master.
- There is callback registration by routing clients such as
MTA or SMTP.
- There are routing group membership changes.
- You rename the routing group
- A new master node is elected.
Base-level callbacks
Routing base-level callbacks are updates that occur after a
routing group object is modified, and after the updates are then propagated
throughout the organization. The Winroute.exe major version changes may be
triggered by the following events:
- Renaming a routing group
- Electing a new routing group master
- Removing a routing group member
- Adding a routing group member
One-level callbacks
One-level callbacks are typically updates to routing when changes
that are one level below the routing group object are detected. Some examples
of this are deleting a connector in the routing group and adding a connecter to
the routing group.
DNS
Incorrect configuration of Domain Name System (DNS) may cause
several routing issues. These issues are addressed in the following
sections.
The DNS Resolver sink event on the SMTP virtual server
The DNS Resolver sink event is primarily for resolving external
SMTP domains. Your internal Active Directory servers and DNS servers still have
to be able to resolve all Exchange Server computers internally.
The
SMTP virtual server DNS Resolver sink event is synchronous and can affect
performance on a heavily used server. To slightly improve the situation,
increase the number of threads that are used for DNS lookups.
The DNS
Resolver sink event is used only when a server is not in the Exchange
organization. Exchange Server determines this by querying Active Directory
directory service.
Windows 2000 DNS API
If you use the DNS Resolver tool for name resolution, the lookups
that are created by this tool are asynchronous and are much faster than using
the default settings of the external DNS Resolver sink event.
Exchange DNS that uses the Windows DNS API or the Exchange DNS Resolver sink
event has to be able to resolve an Internet Protocol address (IP address) in
the following ways:
- mail exchanger resource record (MX record)-to-IP
address
- MX record -to-A record-to-IP address
- MX record-to-CNAME record-to-A record-to-IP
address
- CNAME record-to-A record-to-IP address
- A record-to-IP address
DNS records that are incorrectly configured, especially MX
records and CNAME records, may seriously affect mail flow.
Note Although Microsoft Exchange Server 2003 does provide limited
support for chained CNAME records, we do not recommend implementing this
configuration.
In Microsoft Exchange Server 2003, the external DNS
Resolver sink event has been improved. Additionally, you can use the DNS
Diagnostic utility (DNSdiag.exe) from the Windows Server 2003 Resource Kit to
troubleshoot DNS issues that involve the external SMTP resolver and the Windows
TCP/IP DNS. DNSdiag.exe shows the asynchronous queries and the synchronous
queries to Global DNS servers or to the DNS server that are called by the DNS
sink event. Additionally, DNSdiag.exe shows any corresponding failures or
errors.
Note The DNS Diagnostic utility is also known as also known as the DNS
Resolver tool. They are the same file, DNSdiag.exe. The following file is available for download from the Microsoft
Download Center:
Download
the Dnsdiag.exe package now. For more information about how to
download Microsoft Support files, click the following article number to view
the article in the Microsoft Knowledge Base:
119591 How to obtain Microsoft support files from online services
Microsoft scanned this file for viruses. Microsoft used the most
current virus-detection software that was available on the date that the file
was posted. The file is stored on security-enhanced servers that help prevent
any unauthorized changes to the file.