Indexes and parallelism
Frequently, bursts of I/O occur because an index is missing. This behavior can severely push the I/O path. A pass that uses the Index Turning Wizard (ITW) may help resolve I/O pressure on the system. If a query benefits from an index instead of a table scan, or perhaps if it uses a sort or hash, the system can gain the following advantages:
- A reduction is made in the physical I/O that is required to complete the action that directly creates performance benefits for the query.
- Fewer pages in the data cache have to be turned over. Therefore, those pages that are in the data cache remain relevant to active queries.
- Sorts and hashes are used because an index may be missing or because statistics are out of date. You may reduce tempdb use and contention by adding one or more indexes.
- A reduction is made in resources, parallel operations, or both. Because SQL Server does not guarantee parallel query execution, and because the load on the system is considered, it is best to optimize all queries for serial execution. To optimize a query, open Query Analyzer and set the sp_configure value of the max degree of parallelism option to 1. If all the queries are tuned to run promptly as a serial operation, parallel execution is often just a better result. However, many times parallel execution is selected because the amount of data is just large. For a missing index, a large sort may have to occur. Multiple workers that are performing the sort operation will create a quicker response. However, this action can dramatically increase the pressure on the system. Large read requests from many workers can cause an I/O burst together with increased CPU usage from multiple workers. Many times a query can be tuned to run faster and to use fewer resources if an index is added or if another tuning action occurs.
Practical examples from Microsoft SQL Server Support
The following examples have been handled by Microsoft SQL Server Support and Platforms Escalation Support. These examples are intended to provide a frame of reference and help set your expectations about stalled and stuck I/O situations and about how a system may be affected or may respond. There is no specific hardware or set of drivers that pose any specific risk or increased risk over another. All systems are the same in this respect.
Example 1: A log write that is stuck for 45 seconds
An attempt to write a SQL Server log file periodically becomes stuck for approximately 45 seconds. The log write does not finish in a timely manner. This behavior creates a blocking condition that causes 30-second client time-outs.
The application submitted a commit to SQL Server, and the commit became stuck as a log write pending. This behavior causes the query to continue holding locks and to block incoming requests from other clients. Then, other clients start to time out. This compounds the problem because the application does not roll back open transactions when a query time-out occurs. This creates hundreds of open transactions that are holding locks. Therefore, a severe blocking situation occurs.
For more information about transaction handling and blocking, see the following Microsoft Knowledge Base article:
The application services a website by using connection pooling. As more connections become blocked, the website creates more connections. These connections become blocked, and the cycle continues.
After approximately 45 seconds, the log write finishes. However, by this time, hundreds of connections are backed up. The blocking problems cause several minutes of recovery time for SQL Server and for the application. Combined with the application problems, the stalled I/O condition has a very negative effect on the system.
Resolution
This problem was tracked to a stuck I/O request in a Host Bus Adapter (HBA) driver. The computer had multiple HBA cards with failover support. When one HBA was behind or was not communicating with the Storage Area Network (SAN), the "retry before failover" time-out value was configured to 45 seconds. When the time-out was exceeded, the I/O request was routed to the second HBA. The second HBA handled the request and quickly finished. To help prevent such stall conditions, the hardware manufacturer recommended a "retry before failover" setting of 5 seconds.
Example 2: Filter driver intervention
Many antivirus software programs and backup products use I/O filter drivers. These I/O filter drivers become part of the I/O request stack, and they have access to the IRP request. Microsoft Product Support Services has seen various issues from bugs that create stuck I/O conditions or stalled I/O conditions in a filter driver implementation.
One such condition was a filter driver for backup processing that allowed a backup of the files that were open when the backup occurred. The system administrator had included the SQL Server data file directory in the file backup selections. When the backup occurred, the backup tried to gather the correct image of the file at the time the backup started. Doing this delayed I/O requests. The I/O requests were allowed to finish only one at a time as they were handled by the software.
When the backup started, SQL Server performance dropped dramatically because the I/Os of SQL Server were forced to finish one at a time. To compound the issue, the "one at a time" logic was such that the I/O operation could not be performed asynchronously. Therefore, when SQL Server expected to post an I/O request and to continue, the worker was stuck in the read or the write call until the I/O request finished. Processing tasks such as a SQL Server read-ahead were effectively disabled by the actions of the filter driver. Additionally, another bug in the filter driver left the "one at a time" actions in process, even when the backup was finished. The only way to restore SQL Server performance was to close and then reopen the database, or to restart SQL Server so that the file handle was released and reacquired without the filter driver interaction.
Resolution
To resolve this problem, the SQL Server data files were removed from the file backup process. The software manufacture also corrected the problem that left the file in "one at a time" mode.
Example 3: Hidden errors
Many higher-end systems have multichannel I/O paths to handle load balancing or similar activities. Microsoft Product Support has found problems with the load balancing software where an I/O request fails but the software does not handle the error condition correctly. The software can attempt infinite retries. The I/O operation becomes stuck and SQL Server cannot finish the specified action. Much like the log write condition that was described earlier, many poor system behaviors can occur after such a condition wedges the system.
Resolution
To resolve this problem, restarting SQL Server is often required. However, sometimes you must restart the operating system to restore processing. We also recommend that you obtain a software update from the I/O vendor.
Example 4: Remote storage, mirroring, and raid drives
Many systems use mirroring or take similar steps to prevent data loss. Some systems that use mirroring are software-based and some are hardware-based. The situation that is typically discovered by Microsoft Support for these systems is increased latency.
An increase in the overall I/O time occurs when the I/O must finish to the mirror before the I/O is considered complete. For remote mirror installations, network retries can become involved. When drive failures occur and the raid system is rebuilding, the I/O pattern can also be interrupted.
Resolution
Strict configuration settings are required to reduce latency to mirrors or to raid rebuild operations.
For more information, review
the Requirements for SQL Server to support remote mirroring of user databases .
Example 5: Compression
Microsoft does not support Microsoft SQL Server 7.0 or Microsoft SQL Server 2000 data files and log files on compressed drives. NTFS compression is not safe for SQL Server because NTFS compression breaks Write Ahead Logging (WAL) protocol. NTFS compression also requires increased processing for each I/O operation. Compression creates "one at a time" like behavior that causes severe performance issues to occur.
Resolution
To resolve this problem, uncompress the data and the log files.
For more information, review
the Description of support for SQL Server databases on compressed volumes .
Additional data points
PAGEIOLATCH_* and writelog waits in sys.dm_os_wait_stats dynamic management views (DMV) are key indictors to investigate I/O path performance. If you see significant PAGEIOLATCH waits, this means that SQL Server is waiting on the I/O subsystem. A certain amount of PAGEIOLATCH waits is typical and expected behavior. However, if the average PAGEIOLATCH wait times are consistently greater than 10 milliseconds (ms), you should investigate why the I/O subsystem is under pressure. For more information, see the following documents: