SQL Server Optimizations for High Concurrency

Техника

The SQL Server database receives the query request, from the generation of the process, waiting, and waiting time reveals the pressure of system performance to some extent, if the resources are serious, it will become a bottleneck of performance. Therefore, the waiting monitoring is very helpful to diagnose system performance, and perform performance tuning for query statements. Occasionally an exception is waiting, it is not enough to indicate a bottleneck in the system, but SQL Server instances often have a specific wait type, and the wait time tends to increase, which means that there is pressure, or memory, or IO, etc., according to WaitType pair The system performs monitoring and diagnosis, and can also perform performance tuning for the query. For example, Lock waits to indicate that there is data competition in the execution query, and PageLatch is waiting to indicate that the Pagelatch is waiting to indicate that the layout of the file needs to be improved.

Dead lock talk

When the database is dead, SQLServer will release a lower priority lock, allowing another transaction to run; so, so that you can capture the database to kill locks, it is not easy.

If the database is deadlock, the dead lock can be captured.

You can use the SQLServer activity monitor to see which processes locked the database.

First open the SQLServer activity monitor, then you can see that there are processes in the interface, view resources, data files I / O, and recently consume four queries of resources.

Waiting for resources: Waiting for resources to have some locks, you can see more of those lock accumulated wait hours.

Data file I / O: Data file I / O records some database MDF, LDF read and write speed.

Recently consumed a lot of resources: record some SQL queries that consume more resources.

The query process is deadlocked, and then executes the SQL below to unlock it.

Declare @spid int set @SPID = 518 - Lock Process Session ID 
declare @sql varchar(1000) 
set @sql='kill '+cast(@spid as varchar) 
exec(@sql)
select  request_session_id spid,  OBJECT_NAME(resource_associated_entity_id) tableName    
from  sys.dm_tran_locks where  resource_type='OBJECT' 

The recent query consumed a lot of resources can also be queried by SQL.

SELECT TOP 10 TOTAL_WORKER_TIME / 1000 AS [Total CPU Time (MS - milliseconds) used in the self-compilation,
 Total_ELAPSED_TIME / 1000 AS [Complete the total time to do this plan],
 Total_Elapsed_time / Execution_Count / 1000 AS [Average completion of this plan for this plan],
 Execution_count as [number of times executed so since the last compile],    
 CREATION_TIME AS [Compile Plan Time],
 DEQS. TOTAL_WORKER_TIME / DEQS.EXECUTION_COUNT / 1000 AS [Average CPU Time (MS)],
 Last_execution_time as [Time to start the plan last time],
 Total_Physical_Reads [The number of physical reads performed during the execution period],
 Total_Logical_Reads / Execution_Count [Average logic reading],
 MIN_WORKER_TIME / 1000 AS [Minimum CPU Time (MS)] for single execution period (MS)],
 MAX_WORKER_TIME / 1000 AS [The Maximum CPU Time (MS)] for single execution period,
SUBSTRING(dest.text, deqs.statement_start_offset / 2 + 1,          
 (CASE WHEN DEQS.STATEMENT_END_OFFSET = -1 THEN DATALENGTH (DEST.TEXT) ELSE DEQS.STATEMENT_END_OFFSET END - DEQS.STATEMENT_START_OFFSET "/ 2 + 1) AS [Execute SQL], 
 Dest.Text As [complete sql],
 DB_NAME (DEST.DBID) AS [Database Name],
 Object_name (dest.ObjectID, DEST.DBID) AS [object name]
 , deqs.plan_handle [Query the compiled plan to which you belong]
FROM sys.dm_exec_query_stats deqs WITH(NOLOCK)
 Cross Apply sys.dm_exec_sql_text (dev.sql_handle) AS DEST - Average CPU Time Designation 
ORDER BY (deqs.total_worker_time / deqs.execution_count / 1000) DESC

In the SQL Server Activity Monitor, view the resource waiting.

You can usually see the waiting category is the top of the latch, as shown below:

SELECT * FROM sys.dm_os_latch_stats

The results of the query are shown below:

From the results, you can see the number of requests, waiting time, maximum wait time (milliseconds).

But these lock types are English-English, and they need to use the table below to query their true meaning.

Through the comparative form, we found the meaning of the most resource-consumable Access_Methods_DataSet_parent lock that is concurrent to send the resource access to the resource. So want to reduce concurrent operation, you can reduce the resource consumption of Access_Methods_DataSet_Parent lock.

Latch reference website:https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-latch-stats-transact-sql?view=sql-server-2017

Maintenance: Tasks for maintenance of databases, indexes, or file groups.

Miscellaneous: Miscellaneous tasks, such as enabling tracking marks or deleting DLLs from memory.

Information: Task for collecting and displaying various types of information.

Verification: Verification operations for allocation of databases, tables, indexes, directories, file groups, or database pages.

DBCC ShrinkDatabase (N 'Book Name ", 1) 

DBID: The database engine tries to shrink the database identification number of the file.

FileID: The database engine is trying to shrink the file identification number of the file.

Currentsize: The number of 8 KB pages currently occupied by the file.

Minimumsize: The number of 8 KB pages that can be occupied by the file. This corresponds to the size of the minimum size or initial creation.

UsedPages: The number of 8 KB pages currently used.

EstimatedPages: The database engine estimates the number of 8 KB pages that can be shrunk.

If shrinkage is unsuccessful, you can check if there is a space that can shrink.

SELECT name ,size/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS AvailableSpaceInMB
FROM sys.database_files;

If there is space that is not successful, it may be reasonable.

DBCC reference website:https://docs.microsoft.com/zh-cn/sql/t-sql/database-console-commands/dbcc-shrinkdatabase-transact-sql?view=sql-server-2017

The SQLServer database log encrypts the executed SQL statement, so in the log, we can’t see the true SQL statement.

If you want to view the SQL statement, you need to use some tools such as ApexSqllog.

However, although you can’t see the SQL statement, you can also see some database issues through the log. For example, you can view how many times insert, update, etc.

SELECT * FROM [sys].[fn_dblog](NULL,NULL)

Note: This article is original, any form of reprint, please contact the author to obtain authorization and indicate the source!
If you think this article is not bad, please click below,thank you very much!

In a multithreaded process what would happens when a one thread updates a data or index page in memory while second thread is reading the same page?

What will happen when 1st  thread reads a data/index page in memory while 2nd thread is freeing the same page from memory?

Answer: We would end up with data or data structure inconsistency. To avoid inconsistency SQL Server uses Synchronization Mechanisms like Locks,Latches and Spinlocks.

We will discuss few key points about latches and how to debug latch timeout dumps in this blog.

What is Latch ?

Types of the Latch

Buffer (BUF) Latch

Used to synchronize access to BUF structures and their associated database pages.

Buffer “IO” Latch

A subset of BUF latches used when the BUF and associated data/index page is in the middle of an IO operation (Reading page from disk or writing page to disk).

Non-Buffer (Non-BUF) Latch

These are latches that are used to synchronize general in-memory data structures generally used by queries/tasks executed by parallel threads, auto grow operations , shrink operations etc. 

Keep (KP) Latches

Used to ensure that the page is not released from memory while it is in use. 

Used for read-only access to data structures and prevent write access by others threads.

SH is compatible with KP, SH, and UP.  It should be noted that although in general SH implies read-only access, it is not always the case. For buffer latches SH is the minimum mode required in order to read a data page.

Update (UP) Latches

Allows read access to the data structure(Compatible with SH and KP), but prevents other EX-latch access. 

Used for write operations when torn page detection is off and when AWE is not enabled.

Exclusive (EX) Latches

Prevents any read activity from occurring on the latched structure. EX is only compatible with KP.

Used during read IO during write IO when torn page detection is on or AWE is enabled.

Destroy (DT) Latches

Used when removing BUFs from the buffer pool, either by adding them to the free list or unmapping AWE buffers. 

How do you identify Latch contention?

Latch contention can be identified using below wait types in sysprocesses.

PAGEIOLATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on a physical I/O of a buffer pool page to complete. 

                                            1. PAGEIOLATCH_* are commonly solved by tuning the queries which are performing heavy IO (Commonly by adding, changing and removing indexes (or) statistics to reduce the amount of physical IO).

                                 2. Identifying if there is disk bottleneck and fixing them (Pageiolatch wait times (ex > 30 ms))

PAGELATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on access to a database page, but the page is not undergoing physical IO. 

1.       This problem is normally caused by a large number of sessions attempting to access the same physical page at the same time. We should Look at the wait resource of the spid. The wait_resource is the page number (the format is  dbid:file:pageno)

          that is being accessed. 

2.       We can use DBCC PAGE to identify object or type of the page in which we have the contention. Also it will help us to determine  whether contention  is for allocation, data or text.

3.       If the pages that SQL Server is most frequently waiting on are in tempdb database ,check the wait resource column for a page number in dbid 2. You may be facing tempdb allocation latch contention mentioned in

          clustered index key to spread the work across different pages.

LATCH_*:    Non-buf latch waits can be caused by variety of things.  We can use the wait resource column in sysprocesses to determine the type of latch involved(KB 822101). 

2.       Auto Grow and auto shrink.

When a latch is requested by thread and If  that latch cannot be granted immediately because of some other thread holding a incompatible latch on same page or data structure then  the requestor must wait for the latch to be grantable.  Warning messages like one below is printed in SQL Server error log and a mini dump with all the threads is captures if the wait interval reaches 5 minutes (). The warning message differs for buffer and non-buffer latches.

844: Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.

846: A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.

847: Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

Break up of above warning

Task for which we are trying to acquire latch.

The total time waited for this latch acquire request in seconds.

The address of the Task that owns the latch, if available.

bp (Buffer latches only)

The address of the BUF structure corresponding to this buffer latch.

page (Buffer latches only.)

The page id for the page currently contained in the BUF structure.

database id (Buffer latches only.)

The database id for the page in the BUF.

When there is latch timeout dump you will see a warning message similar to one below. Warning error message printed in SQL server errorlog before the dump is very important to find the owner thread of latch.

2012-01-18 00:52:03.16 spid69      A time-out occurred while waiting for buffer latch — type 4, bp 00000000ECFDAA00, page 1:6088, stat 0x4c1010f, database id: 4, allocation unit Id: 72057594043367424, task 0x0000000006E096D8 : 0, waittime 300, flags 0x19,

owning task 0x0000000006E08328. Not continuing to wait.

spid21s     **Dump thread – spid = 21, PSS = 0x0000000094622B60, EC = 0x0000000094622B70

spid21s     ***Stack Dump being sent to E:\Data\Disk1\MSSQL.1\MSSQL\LOG\SQLDump0009.txt

spid21s     * *******************************************************************************

spid21s     * BEGIN STACK DUMP:

spid21s     *   02/28/12 00:32:03 spid 21

spid21s     * Latch timeout

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_COUNT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_VIRTUAL_ROOT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

From the error message above we can easily understand we are trying to acquire latch on database id: 4, page 1:6088 (6088 page of first file) and we timed out because task 0x0000000006E08328 (owning task 0x0000000006E08328 in warning message)  is holding a latch on it.

: Task is simply a work request to be performed by the thread. (such as system tasks, login task, Ghost cleanup task etc.). Threads which execute the task will take required latches on need.

Let us see how to analyze latch timeout dump and get the owning thread of the Latch using the  owning task .

To analyze the dump download and Install Windows Debugger from

Open Windbg .  Choose File menu –> select Open crash dump –>Select the Dump file (SQLDump000#.mdmp)

on command window type
.sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;

Type and hit enter. This will force debugger to immediately load all the symbols.

Verify if symbols are loaded for  SQL Server by using the debugger command lmvm

0:002> lmvm sqlservr
start             end                 module name
00000000`01000000 00000000`03679000   sqlservr T (pdb symbols)          c:\websymbols\sqlservr.pdb\21E4AC6E96294A529C9D99826B5A7C032\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe
    Image name: sqlservr.exe
    Timestamp:        Wed Oct 07 21:15:52 2009 (4ACD6778)
    CheckSum:         025FEB5E
    ImageSize:        02679000
    File version:     2005.90.4266.0
    Product version:  9.0.4266.0
    File flags:       0 (Mask 3F)
    File OS:          40000 NT Base
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

Use the below command to search thread stack to identify the thread which has reference to the owning task and it will be the thread which is owning the latch. Replace with owning task in your errorlog

Дополнительно:  Почему экран телевизора стал синий - что делать, ответы экспертов

From the above out put we see thread reference to the pointer of owning task and it will be the thread which is owning the latch. Let us switch to the thread() which is executing the owning task and

then go through the stack to see why the thread is owning the latch for long time.

0:002>   ==>  Print the stack
Call Site
ntdll!ZwWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::Suspend
sqlservr!SOS_Event::Wait
sqlservr!BPool::FlushCache
sqlservr!checkpoint2
sqlservr!alloca_probe
sqlservr!ProcessCheckpointRequest
sqlservr!CheckpointLoop
sqlservr!ckptproc
sqlservr!SOS_Task::Param
::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex

From the above stack we can understand that the thread which is owning the latch is executing checkpoint and flushing cache (Dirty buffers) to disk. If flushing buffers to disk (checkpoint) is taking a long time, then obviously there is disk bottleneck.

Similarly for any other latch time out issues first identify the owner thread of latch, read the stack of owner thread to understand the task performed by owner thread and troubleshoot the performance of task performed by owner thread.

If you want to see the stack of thread which is waiting, then pickup the task (task )from latch timeout warning message in errorlog instead of owning task (task ) and use the command mentioned in step 5.

I hope this post will help you to learn and debug the latch timeout issues.

In a multithreaded process what would happens when a one thread updates a data or index page in memory while second thread is reading the same page?

What will happen when 1st  thread reads a data/index page in memory while 2nd thread is freeing the same page from memory?

Answer: We would end up with data or data structure inconsistency. To avoid inconsistency SQL Server uses Synchronization Mechanisms like Locks,Latches and Spinlocks.

We will discuss few key points about latches and how to debug latch timeout dumps in this blog.

What is Latch ?

Types of the Latch

Buffer (BUF) Latch

Used to synchronize access to BUF structures and their associated database pages.

Buffer “IO” Latch

A subset of BUF latches used when the BUF and associated data/index page is in the middle of an IO operation (Reading page from disk or writing page to disk).

Non-Buffer (Non-BUF) Latch

These are latches that are used to synchronize general in-memory data structures generally used by queries/tasks executed by parallel threads, auto grow operations , shrink operations etc. 

Keep (KP) Latches

Used to ensure that the page is not released from memory while it is in use. 

Used for read-only access to data structures and prevent write access by others threads.

SH is compatible with KP, SH, and UP.  It should be noted that although in general SH implies read-only access, it is not always the case. For buffer latches SH is the minimum mode required in order to read a data page.

Update (UP) Latches

Allows read access to the data structure(Compatible with SH and KP), but prevents other EX-latch access. 

Used for write operations when torn page detection is off and when AWE is not enabled.

Exclusive (EX) Latches

Prevents any read activity from occurring on the latched structure. EX is only compatible with KP.

Used during read IO during write IO when torn page detection is on or AWE is enabled.

Destroy (DT) Latches

Used when removing BUFs from the buffer pool, either by adding them to the free list or unmapping AWE buffers. 

How do you identify Latch contention?

Latch contention can be identified using below wait types in sysprocesses.

PAGEIOLATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on a physical I/O of a buffer pool page to complete. 

                                            1. PAGEIOLATCH_* are commonly solved by tuning the queries which are performing heavy IO (Commonly by adding, changing and removing indexes (or) statistics to reduce the amount of physical IO).

                                 2. Identifying if there is disk bottleneck and fixing them (Pageiolatch wait times (ex > 30 ms))

PAGELATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on access to a database page, but the page is not undergoing physical IO. 

1.       This problem is normally caused by a large number of sessions attempting to access the same physical page at the same time. We should Look at the wait resource of the spid. The wait_resource is the page number (the format is  dbid:file:pageno)

          that is being accessed. 

2.       We can use DBCC PAGE to identify object or type of the page in which we have the contention. Also it will help us to determine  whether contention  is for allocation, data or text.

3.       If the pages that SQL Server is most frequently waiting on are in tempdb database ,check the wait resource column for a page number in dbid 2. You may be facing tempdb allocation latch contention mentioned in

          clustered index key to spread the work across different pages.

LATCH_*:    Non-buf latch waits can be caused by variety of things.  We can use the wait resource column in sysprocesses to determine the type of latch involved(KB 822101). 

2.       Auto Grow and auto shrink.

When a latch is requested by thread and If  that latch cannot be granted immediately because of some other thread holding a incompatible latch on same page or data structure then  the requestor must wait for the latch to be grantable.  Warning messages like one below is printed in SQL Server error log and a mini dump with all the threads is captures if the wait interval reaches 5 minutes (). The warning message differs for buffer and non-buffer latches.

844: Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.

846: A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.

847: Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

Break up of above warning

Task for which we are trying to acquire latch.

The total time waited for this latch acquire request in seconds.

The address of the Task that owns the latch, if available.

bp (Buffer latches only)

The address of the BUF structure corresponding to this buffer latch.

page (Buffer latches only.)

The page id for the page currently contained in the BUF structure.

database id (Buffer latches only.)

The database id for the page in the BUF.

When there is latch timeout dump you will see a warning message similar to one below. Warning error message printed in SQL server errorlog before the dump is very important to find the owner thread of latch.

2012-01-18 00:52:03.16 spid69      A time-out occurred while waiting for buffer latch — type 4, bp 00000000ECFDAA00, page 1:6088, stat 0x4c1010f, database id: 4, allocation unit Id: 72057594043367424, task 0x0000000006E096D8 : 0, waittime 300, flags 0x19,

owning task 0x0000000006E08328. Not continuing to wait.

spid21s     **Dump thread – spid = 21, PSS = 0x0000000094622B60, EC = 0x0000000094622B70

spid21s     ***Stack Dump being sent to E:\Data\Disk1\MSSQL.1\MSSQL\LOG\SQLDump0009.txt

spid21s     * *******************************************************************************

spid21s     * BEGIN STACK DUMP:

spid21s     *   02/28/12 00:32:03 spid 21

spid21s     * Latch timeout

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_COUNT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_VIRTUAL_ROOT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

From the error message above we can easily understand we are trying to acquire latch on database id: 4, page 1:6088 (6088 page of first file) and we timed out because task 0x0000000006E08328 (owning task 0x0000000006E08328 in warning message)  is holding a latch on it.

: Task is simply a work request to be performed by the thread. (such as system tasks, login task, Ghost cleanup task etc.). Threads which execute the task will take required latches on need.

Let us see how to analyze latch timeout dump and get the owning thread of the Latch using the  owning task .

To analyze the dump download and Install Windows Debugger from

Open Windbg .  Choose File menu –> select Open crash dump –>Select the Dump file (SQLDump000#.mdmp)

on command window type
.sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;

Type and hit enter. This will force debugger to immediately load all the symbols.

Verify if symbols are loaded for  SQL Server by using the debugger command lmvm

0:002> lmvm sqlservr
start             end                 module name
00000000`01000000 00000000`03679000   sqlservr T (pdb symbols)          c:\websymbols\sqlservr.pdb\21E4AC6E96294A529C9D99826B5A7C032\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe
    Image name: sqlservr.exe
    Timestamp:        Wed Oct 07 21:15:52 2009 (4ACD6778)
    CheckSum:         025FEB5E
    ImageSize:        02679000
    File version:     2005.90.4266.0
    Product version:  9.0.4266.0
    File flags:       0 (Mask 3F)
    File OS:          40000 NT Base
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

Use the below command to search thread stack to identify the thread which has reference to the owning task and it will be the thread which is owning the latch. Replace with owning task in your errorlog

From the above out put we see thread reference to the pointer of owning task and it will be the thread which is owning the latch. Let us switch to the thread() which is executing the owning task and

then go through the stack to see why the thread is owning the latch for long time.

0:002>   ==>  Print the stack
Call Site
ntdll!ZwWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::Suspend
sqlservr!SOS_Event::Wait
sqlservr!BPool::FlushCache
sqlservr!checkpoint2
sqlservr!alloca_probe
sqlservr!ProcessCheckpointRequest
sqlservr!CheckpointLoop
sqlservr!ckptproc
sqlservr!SOS_Task::Param
::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex

From the above stack we can understand that the thread which is owning the latch is executing checkpoint and flushing cache (Dirty buffers) to disk. If flushing buffers to disk (checkpoint) is taking a long time, then obviously there is disk bottleneck.

Similarly for any other latch time out issues first identify the owner thread of latch, read the stack of owner thread to understand the task performed by owner thread and troubleshoot the performance of task performed by owner thread.

If you want to see the stack of thread which is waiting, then pickup the task (task )from latch timeout warning message in errorlog instead of owning task (task ) and use the command mentioned in step 5.

I hope this post will help you to learn and debug the latch timeout issues.

In a multithreaded process what would happens when a one thread updates a data or index page in memory while second thread is reading the same page?

What will happen when 1st  thread reads a data/index page in memory while 2nd thread is freeing the same page from memory?

Answer: We would end up with data or data structure inconsistency. To avoid inconsistency SQL Server uses Synchronization Mechanisms like Locks,Latches and Spinlocks.

We will discuss few key points about latches and how to debug latch timeout dumps in this blog.

What is Latch ?

Types of the Latch

Buffer (BUF) Latch

Used to synchronize access to BUF structures and their associated database pages.

Buffer “IO” Latch

A subset of BUF latches used when the BUF and associated data/index page is in the middle of an IO operation (Reading page from disk or writing page to disk).

Non-Buffer (Non-BUF) Latch

These are latches that are used to synchronize general in-memory data structures generally used by queries/tasks executed by parallel threads, auto grow operations , shrink operations etc. 

Keep (KP) Latches

Used to ensure that the page is not released from memory while it is in use. 

Used for read-only access to data structures and prevent write access by others threads.

SH is compatible with KP, SH, and UP.  It should be noted that although in general SH implies read-only access, it is not always the case. For buffer latches SH is the minimum mode required in order to read a data page.

Update (UP) Latches

Allows read access to the data structure(Compatible with SH and KP), but prevents other EX-latch access. 

Used for write operations when torn page detection is off and when AWE is not enabled.

Exclusive (EX) Latches

Prevents any read activity from occurring on the latched structure. EX is only compatible with KP.

Used during read IO during write IO when torn page detection is on or AWE is enabled.

Destroy (DT) Latches

Used when removing BUFs from the buffer pool, either by adding them to the free list or unmapping AWE buffers. 

How do you identify Latch contention?

Latch contention can be identified using below wait types in sysprocesses.

PAGEIOLATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on a physical I/O of a buffer pool page to complete. 

                                            1. PAGEIOLATCH_* are commonly solved by tuning the queries which are performing heavy IO (Commonly by adding, changing and removing indexes (or) statistics to reduce the amount of physical IO).

                                 2. Identifying if there is disk bottleneck and fixing them (Pageiolatch wait times (ex > 30 ms))

PAGELATCH_*: This waittype in sysprocesses indicates that SQL Server is waiting on access to a database page, but the page is not undergoing physical IO. 

1.       This problem is normally caused by a large number of sessions attempting to access the same physical page at the same time. We should Look at the wait resource of the spid. The wait_resource is the page number (the format is  dbid:file:pageno)

Дополнительно:  Произошла ошибка связанная с работой видеокарты WoT - как решить?

          that is being accessed. 

2.       We can use DBCC PAGE to identify object or type of the page in which we have the contention. Also it will help us to determine  whether contention  is for allocation, data or text.

3.       If the pages that SQL Server is most frequently waiting on are in tempdb database ,check the wait resource column for a page number in dbid 2. You may be facing tempdb allocation latch contention mentioned in

          clustered index key to spread the work across different pages.

LATCH_*:    Non-buf latch waits can be caused by variety of things.  We can use the wait resource column in sysprocesses to determine the type of latch involved(KB 822101). 

2.       Auto Grow and auto shrink.

When a latch is requested by thread and If  that latch cannot be granted immediately because of some other thread holding a incompatible latch on same page or data structure then  the requestor must wait for the latch to be grantable.  Warning messages like one below is printed in SQL Server error log and a mini dump with all the threads is captures if the wait interval reaches 5 minutes (). The warning message differs for buffer and non-buffer latches.

844: Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.

846: A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.

847: Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

Break up of above warning

Task for which we are trying to acquire latch.

The total time waited for this latch acquire request in seconds.

The address of the Task that owns the latch, if available.

bp (Buffer latches only)

The address of the BUF structure corresponding to this buffer latch.

page (Buffer latches only.)

The page id for the page currently contained in the BUF structure.

database id (Buffer latches only.)

The database id for the page in the BUF.

When there is latch timeout dump you will see a warning message similar to one below. Warning error message printed in SQL server errorlog before the dump is very important to find the owner thread of latch.

2012-01-18 00:52:03.16 spid69      A time-out occurred while waiting for buffer latch — type 4, bp 00000000ECFDAA00, page 1:6088, stat 0x4c1010f, database id: 4, allocation unit Id: 72057594043367424, task 0x0000000006E096D8 : 0, waittime 300, flags 0x19,

owning task 0x0000000006E08328. Not continuing to wait.

spid21s     **Dump thread – spid = 21, PSS = 0x0000000094622B60, EC = 0x0000000094622B70

spid21s     ***Stack Dump being sent to E:\Data\Disk1\MSSQL.1\MSSQL\LOG\SQLDump0009.txt

spid21s     * *******************************************************************************

spid21s     * BEGIN STACK DUMP:

spid21s     *   02/28/12 00:32:03 spid 21

spid21s     * Latch timeout

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_COUNT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_HOBT_VIRTUAL_ROOT’, id 00000002D8C32E70, type 2, Task 0x00000000008FCBC8 : 7, waittime 300, flags 0x1a, owning task 0x00000000050E1288. Continuing to wait.

From the error message above we can easily understand we are trying to acquire latch on database id: 4, page 1:6088 (6088 page of first file) and we timed out because task 0x0000000006E08328 (owning task 0x0000000006E08328 in warning message)  is holding a latch on it.

: Task is simply a work request to be performed by the thread. (such as system tasks, login task, Ghost cleanup task etc.). Threads which execute the task will take required latches on need.

Let us see how to analyze latch timeout dump and get the owning thread of the Latch using the  owning task .

To analyze the dump download and Install Windows Debugger from

Open Windbg .  Choose File menu –> select Open crash dump –>Select the Dump file (SQLDump000#.mdmp)

on command window type
.sympath srv*c:\Websymbols*http://msdl.microsoft.com/download/symbols;

Type and hit enter. This will force debugger to immediately load all the symbols.

Verify if symbols are loaded for  SQL Server by using the debugger command lmvm

0:002> lmvm sqlservr
start             end                 module name
00000000`01000000 00000000`03679000   sqlservr T (pdb symbols)          c:\websymbols\sqlservr.pdb\21E4AC6E96294A529C9D99826B5A7C032\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe
    Image name: sqlservr.exe
    Timestamp:        Wed Oct 07 21:15:52 2009 (4ACD6778)
    CheckSum:         025FEB5E
    ImageSize:        02679000
    File version:     2005.90.4266.0
    Product version:  9.0.4266.0
    File flags:       0 (Mask 3F)
    File OS:          40000 NT Base
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

Use the below command to search thread stack to identify the thread which has reference to the owning task and it will be the thread which is owning the latch. Replace with owning task in your errorlog

From the above out put we see thread reference to the pointer of owning task and it will be the thread which is owning the latch. Let us switch to the thread() which is executing the owning task and

then go through the stack to see why the thread is owning the latch for long time.

0:002>   ==>  Print the stack
Call Site
ntdll!ZwWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::Suspend
sqlservr!SOS_Event::Wait
sqlservr!BPool::FlushCache
sqlservr!checkpoint2
sqlservr!alloca_probe
sqlservr!ProcessCheckpointRequest
sqlservr!CheckpointLoop
sqlservr!ckptproc
sqlservr!SOS_Task::Param
::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex

From the above stack we can understand that the thread which is owning the latch is executing checkpoint and flushing cache (Dirty buffers) to disk. If flushing buffers to disk (checkpoint) is taking a long time, then obviously there is disk bottleneck.

Similarly for any other latch time out issues first identify the owner thread of latch, read the stack of owner thread to understand the task performed by owner thread and troubleshoot the performance of task performed by owner thread.

If you want to see the stack of thread which is waiting, then pickup the task (task )from latch timeout warning message in errorlog instead of owning task (task ) and use the command mentioned in step 5.

I hope this post will help you to learn and debug the latch timeout issues.

SQL Server Optimizations for High Concurrency

Our business needs very robust, low latency, highly available and durable online transactional system which supports for about four weeks in a year. It’s almost like Thanksgiving sale where you mark down very popular item (think of iPhone ) by 100 % . We operate on state level ( K-12 ) online assessment market where entire state takes test in spring for few weeks. To operate in this market , you need to have a robust system which can absorb such a sudden spike in transaction volumes.

:
We use SQL Server 2016 on Windows Server Failover cluster. Our storage layer includes RAID 10 SAN Storage and Local SSD for TempDB.


During initial load testing, we have noticed high occurrences of LATCH Waits. While investigating the root cause for LATCH waits , we found that few tables being accessed by high number of concurrent sessions thus creating Waits. Most of our transactions are very small data set inserts and updates. We also encountered Last Page Insertion Contention.

are internal to the SQL engine and are used to provide memory consistency, whereas are used by SQL Server to provide logical transactional consistency.

is used to synchronize short term access to database pages that reside in the , as opposed to a , which is used to synchronize physical access to pages in disk. These are normal in every system, the problem being when there is contention . In our use case many concurrent sessions accesses a single page, causing waits and hindering the ability to perform these inserts and updates efficiently.

A page in SQL Server is 8KB and can store multiple rows. To increase concurrency and performance, are held only for the duration of the physical operation on the page, unlike locks which are held for the duration of the logical transaction.

From the root cause Analysis, it became very clear that we have LATCH Contention on few tables which needed to be alleviated to improve the throughput.

Latch Waits Noticed

Latch Waits Noticed

From the above screenshot, you can see that we had very high LATCH contention for Page 5261488. You can turn on the Trace Flag 3604 to further investigate the Page Contents.

Latch Waits and PFS Page or Not

TempDB Allocation Page Contention

DBCC TRACEON(3604)
CREATE TABLE PageResults (ParentObject sysname,	OBJECT sysname ,Field sysname ,VALUE nvarchar(MAX))
INSERT INTO PageResults (ParentObject,	Object	,Field,	VALUE)
EXEC ('DBCC PAGE(67,1,3987384,3) WITH tableresults')

Root Cause Identification:

Clear WaitStats and Buffer Cache to Initialize

-- Remove all elements from the plan cache for one database
DECLARE @intDBID INT;
SET @intDBID = ( SELECT [dbid]
FROM   master.dbo.sysdatabases
WHERE  name = 'LoadTestDB'
);
-- Flush the procedure cache for one database only
IF @intDBID IS NOT NULL
BEGIN
DBCC FLUSHPROCINDB (@intDBID);
END;
GO
-- Reset wait and latch statistics.
DBCC SQLPERF('sys.dm_os_latch_stats' , CLEAR)
DBCC SQLPERF('sys.dm_os_wait_stats' , CLEAR)

Wait Stats Query:

Declare @ExcludedWaits Table (WaitType sysname not null primary key)
-- Waits that can be ignored
Insert Into @ExcludedWaits
Values ('CLR_SEMAPHORE'),
('SQLTRACE_BUFFER_FLUSH'),
('WAITFOR'),
('REQUEST_FOR_DEADLOCK_SEARCH'),
('XE_TIMER_EVENT'),
('BROKER_TO_FLUSH'),
('BROKER_TASK_STOP'),
('CLR_MANUAL_EVENT'),
('CLR_AUTO_EVENT'),
('FT_IFTS_SCHEDULER_IDLE_WAIT'),
('XE_DISPATCHER_WAIT'),
('XE_DISPATCHER_JOIN'),
('BROKER_RECEIVE_WAITFOR');
Select SessionID = WT.session_id,
WaitDuration_ms = WT.wait_duration_ms,
WaitType = WT.wait_type,
WaitResource = WT.resource_description,
Program = S.program_name,
QueryPlan = CP.query_plan,
SQLText = SUBSTRING(ST.text, (R.statement_start_offset/2)+1,
((Case R.statement_end_offset
When -1 Then DATALENGTH(ST.text)
Else R.statement_end_offset
End - R.statement_start_offset)/2) + 1),
DBName = DB_NAME(R.database_id),
BlocingSessionID = WT.blocking_session_id,
BlockerQueryPlan = CPBlocker.query_plan,
BlockerSQLText = SUBSTRING(STBlocker.text, (RBlocker.statement_start_offset/2)+1,
((Case RBlocker.statement_end_offset
When -1 Then DATALENGTH(STBlocker.text)
Else RBlocker.statement_end_offset
End - RBlocker.statement_start_offset)/2) + 1)
From sys.dm_os_waiting_tasks WT
Inner Join sys.dm_exec_sessions S on WT.session_id = S.session_id
Inner Join sys.dm_exec_requests R on R.session_id = WT.session_id
Outer Apply sys.dm_exec_query_plan (R.plan_handle) CP
Outer Apply sys.dm_exec_sql_text(R.sql_handle) ST
Left Join sys.dm_exec_requests RBlocker on RBlocker.session_id = WT.blocking_session_id
Outer Apply sys.dm_exec_query_plan (RBlocker.plan_handle) CPBlocker
Outer Apply sys.dm_exec_sql_text(RBlocker.sql_handle) STBlocker
Where R.status = 'suspended' -- Waiting on a resource
And S.is_user_process = 1 -- Is a used process
And R.session_id <> @@spid -- Filter out this session
And WT.wait_type Not Like '%sleep%' -- more waits to ignore
And WT.wait_type Not Like '%queue%' -- more waits to ignore
And WT.wait_type Not Like -- more waits to ignore
Case When SERVERPROPERTY('IsHadrEnabled') = 0 Then 'HADR%'
Else 'zzzz' End
And Not Exists (Select 1 From @ExcludedWaits
Where WaitType = WT.wait_type)
ORDER BY WaitDuration_ms DESC
Option(Recompile); -- Don't save query plan in plan cache

Common Latch Waits

;
WITH [Latches] AS
(SELECT
[latch_class],
[wait_time_ms] / 1000.0 AS [WaitS],
[waiting_requests_count] AS [WaitCount],
100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
FROM sys.dm_os_latch_stats
WHERE [latch_class] NOT IN (
N'BUFFER')
AND [wait_time_ms] > 0
)
SELECT
MAX ([W1].[latch_class]) AS [LatchClass],
CAST (MAX ([W1].[WaitS]) AS DECIMAL(14, 2)) AS [Wait_S],
MAX ([W1].[WaitCount]) AS [WaitCount],
CAST (MAX ([W1].[Percentage]) AS DECIMAL(14, 2)) AS [Percentage],
CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (14, 4)) AS [AvgWait_S]
FROM [Latches] AS [W1]
INNER JOIN [Latches] AS [W2]
ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum]
HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95; -- percentage threshold
GO

Latch Waits Resolution:

  1. In -Memory OLTP
  2. Replacing Identity Integer Column with GUID as leading Column for Index
  3. HASH Partioning with Computed Column

Out of these three options, In-Memory OLTP looked very promising but we didn’t have enough time to implement in-memory OLTP Migration. So we adopted second solution. We replaced Identity Integer Columns with GUID which did increase page splits and index fragmentation but our workload is insert heavy. So we made this trade-off.This is not our preferred resolution but we made this trade off based on available time and resources.

In future, we are planning on migrating these hot tables to In-Memory OLTP. Just replacing INT’s with GUIDs, we did see about 20-30 X Performance improvements.

TempDB Allocation Page Contention  

SELECT  session_id ,
wait_type ,
wait_duration_ms ,
blocking_session_id ,
resource_description ,
ResourceType = CASE WHEN CAST(RIGHT(resource_description,
LEN(resource_description)
- CHARINDEX(':',
resource_description,
3)) AS INT) - 1 % 8088 = 0
THEN 'Is PFS Page'
WHEN CAST(RIGHT(resource_description,
LEN(resource_description)
- CHARINDEX(':',
resource_description,
3)) AS INT) - 2
% 511232 = 0 THEN 'Is GAM Page'
WHEN CAST(RIGHT(resource_description,
LEN(resource_description)
- CHARINDEX(':',
resource_description,
3)) AS INT) - 3
% 511232 = 0 THEN 'Is SGAM Page'
ELSE 'Is Not PFS, GAM, or SGAM page'
END
FROM    sys.dm_os_waiting_tasks
WHERE   wait_type LIKE 'PAGE%LATCH_%'
AND resource_description LIKE '2:%'

TempDB PFS Page Contention  Noticed

PFS_Page_Contention

TempDB Non Allocation Page Contention

TempDB Non Allocation Page Contention

TempDB Non Allocation Page Contention

PFS Page Contention Alleviation:

ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp9',FILENAME = N'T:TempDBtempdb_mssql_9.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp10',FILENAME = N'T:TempDBtempdb_mssql_10.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp11',FILENAME = N'T:TempDBtempdb_mssql_11.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp12',FILENAME = N'T:TempDBtempdb_mssql_12.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp13',FILENAME = N'T:TempDBtempdb_mssql_13.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp14',FILENAME = N'T:TempDBtempdb_mssql_14.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp15',FILENAME = N'T:TempDBtempdb_mssql_15.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp16',FILENAME = N'T:TempDBtempdb_mssql_16.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB )
GO

ThreadPool Waits Observed

We also encountered ThreadPool waits because we had few thousand transactions coming in per second. We increased our Max Worker Thread size. Extreme caution needs to be exercised to modify this configuration because this might lead to unwanted consequences.

Max Number of Worker Threads Defaults

Max Number of Worker Threads Defaults

EXEC sp_configure 'max worker threads', 1472 ;  
GO  
RECONFIGURE;  
GO

ACCESS_METHODS_HOBT_VIRTUAL_ROOT Waits Noticed

ACCESS_METHODS_HOBT_VIRTUAL_ROOT

This latch class is when a thread is waiting for access to the in-memory metadata entry containing a B-tree’s root page ID. EX access is required to change the root page ID, which typically happens when a B-tree becomes a level deeper during heavy insertions into a new index and the existing root page has to split. Every B-tree traversal has to start at the root page, which requires obtaining this latch in SH mode.

Max Degree of Parallelism and Cost Threshold

EXEC sp_configure 'max degree of parallelism','12'
RECONFIGURE

EXEC sp_configure 'cost threshold for parallelism','30'
RECONFIGURE

Recommended Values by Microsoft

Lock Pages in Memory:

We also enabled lock Pages in Memory

Lock Pages in Memory

Lock Pages in Memory

Trace Flag 1118

In SQL Server 2016, By default this trace flag is on. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It’s used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion. But previous versions, you need to explicitly turn on this trace flag.

We learned a lot during this exercise. Supporting high concurrency ( few thousands transactions per second ) needs very robust system and majority of the standard best practices are not applicable in these scenarios. We ran into Latch Waits, Thread Pool Waits, TempDB Allocation Contention and Write Log Waits and solved one problem at a time after carefully considering the tradeoffs. It took several weeks of coordinated effort from several teams to identify the bottleneck in the system and alleviate those bottlenecks.

Six, lock manager waiting (LCK_M _ **)

  • LCK_M_IS: When Task is waiting for the intent lock (IS, INTENT Shared)
  • LCK_M_U: When Task is waiting for the update lock (U, UPDATE) occurs
  
            ,
        (  )               () ()  ,
        ROW_NUMBER() (   )  
        These wait types are almost 100% never a problem and so they are
         filtered out to avoid them skewing the results. Click on the URL
        
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
 
         Maybe comment these four out if you have mirroring issues
        N, 
        N, 
        N, 
        N, 
, 
        N, 
        N, 
        N, 
        N, 
        N, 
 
         Maybe comment these six out if you have AG issues
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N, 
        N 
   
       
     (.)   ( (.)   (,))   ( (.)   (,))   ( (.)   (,))   (.)   ( (.)   (,))   (( (.)   (.))   (,))   (( (.)   (.))   (,))   (( (.)   (.))   (,))   (   (.)  XML)  
   
      .  .
  .
  (.)  ( . )  ; 

Wait statistics, or please tell me where it hurts

Causes of IO_COMPLETION and WRITE_COMPLETION SQL Server wait types

SQL SERVER – IO_COMPLETION – Wait Type – Day 10 of 28

Troubleshooting SQL Server RESOURCE_SEMAPHORE Waittype Memory Issues

First, check the waiting information

This article sharing common waiting types and why they are generated, usually using DMV to view waiting:

  • Sys.dm_exec_requests to see the request that the system is currently processing,
  • Sys.dm_OS_WAIT_STATS Statistics Waiting for the current system
  • Sys.dm_os_waiting_tasks View Task, currently in the waiting state

1, waiting for information statistics

SQL Server saves the waiting information from the previous server, which is started after emptying, in general, waiting type into three categories: resource waiting, queue waiting and external waiting, in daily use, usually filter out the system Waiting type, because these wait for diagnostic performance bottlenecks do not have much place, and filter out the type of waiting time is 0, and the script sees the number of people at the end of the article.

2, clear waiting for information statistics

For generated environments, if you need to collect wait information, then you’d better clear the waiting information, then restart the count, usually use the dbcc sqperf () command to implement:

DBCC SQLPERF(,clear)

Second, Resource Semaphore

Resource_semaphore Waiting Type Indicates a Workder Waiting for SQL Server to give its application to perform Hash and Sort.

1, resource_semaphore reveals memory pressure

When a resource_semaphore is waiting, this indicates that the query statement is not satisfied, that is, the query statement needs a certain amount of memory resources before executing Task. If the current memory of SQL Server is insufficient, the query statement request cannot be assigned Memory will cause the query statement to wait for the state of the memory resources. In the SQL Server storage engine, sort operation and hash (HASH) operation is two operations that consume memory resources, optimize the corresponding query statements to reduce these two operations, can alleviate SQL Server’s memory pressure, But in the SQL Server instance, resource_semaphore waits often, this shows that SQL Server has memory pressure.

There is an option in the database.Min Memory Per QueryThis option indicates that SQL Server is the minimum memory assigned by each query, which means that when a query requires additional memory resources, the memory size obtained by the query is determined by this option, only for each query After a certain amount of memory, the query statement will really start executing.

2. Send resource semaphore to grant request memory (Requested Memory)

SQL Server wants to grant how much memory for each query, the query can really start executing?

  • Step1, calculating the required memory: SQL Server calculates how much memory each query is required to execute, which is usually required to memory and additional memory, when the query request is executed in concurrent mode, the memory formula is 🙁 RequiredMemory * DOP) + Additional memory.
  • STEP2, a REQUESTED MEMORY: SQL Server Checks if the amount of memory required for each query request is exceeded by the system, and SQL Server reduces the number of additional memory, so that the upper limit of the system is not exceeded, this final memory quantity It is a request memory that queries the statement.
  • STEP3, request memory for query: SQL Server instance sends resource semaphore, granting / assigning physical memory for the query (Query) / assignment request.

When the resource signal is sent, if the SQL Server instance cannot be granted the query request memory, the query will be in the resource_semaphore waiting state. SQL Server maintains a wait-up-first-served waiting queue that when the new query is in the Resource_Semaphore wait state, SQL Server places the query into the end of the queue. Once the SQL Server instance finds enough free memory, SQL Server takes out the first query at the top of the Resource_Semaphore, immediately granting the memory itself; after the query gets the request memory; if the SQL Server instance has a long time The query is in the resource_semaphore waiting state, indicating that SQL Server faces memory pressure.

IO-related waiting is related to IO resources, such as backup and restore wait for the wait type: Backupio / BackupBuffer, such waiting occurrence, when the backup task is waiting for the required data or data cache.

1, asynchronous network IOASYNC_NETWORK_IO)

ASYNC_NETWORK_IOWaiting type occurs in a network to write data to the client in time.

The result set generated by SQL Server needs to be passed to the client (Network). If the network does not promptly transfer the result set to the client, resulting in the result set still resides in the session of SQL Server (session), Async_network_io Wait, that is, async_network_io Waiting status appears in SQL Server has prepared data, but the network transmission speed can’t keep up, resulting in the data set returned by SQL Server still resides in the session, this waiting is generally not a database Adjusting the database configuration will not have a big help, the bottleneck of the network layer is of course a possible reason, to consider whether it is necessary to return so much data? So, check if the application is necessary to apply for such a large result set to SQL Server.

This wait type is where SQL Server has sent some data to a client through TDS and is waiting for the client to acknowledge that is has consumed the data, and can also show up with transaction replication if the Log Reader Agent job is running slowly for some reason.

2, asynchronous IO is completed (askNC_IO_COMPLETION)

Async_io_completion: When Task is waiting for the IO to complete, long-time Async_io_completion waits often when SQL Server is executing database backup and restore operations (execute the backup Database command and restore command), view «ASYNC_IO_COMPLETION«understand more.

3, synchronous IO completion (IO_COMPLETION)

Synchronous IO is completed, this wait type generates that SQL Server is waiting for the completion of the IO operation, usually used to represent the IO operation of the non-data page, that is, the various synchronous read and write operations of the file are independent of the table data, and there is a wait Description SQL Server is likely to be the action listed below:

  • Reading transaction logs from a transaction log file
  • Non-data pages are being read, may be a management page (for example, GAM, SGAM, PFS, etc.), often occur in database restore, DB startup, and recovery operations.
  • Write the middle result of the sort into the hard disk
  • In Merge Join operation, the result set is being read and written
  • In EAGER SPOOL operation, you are writing the data set into the hard disk.

When CXPacket waits occurred, if the session data IO (Logical Read / Write, Physical Read) has changed, the session is likely to process the IO operation of the non-data page. IO_COMPLETON Wait is usually used to represent an IO operation of a non-data page, for example, a restore operation of a transaction log, a read system data page (GAM), etc. In general, there are two ways to reduce the IO_completion waiting: First, the IO is dispersed into different Physical DISKs, one is to reduce IO operations for non-data pages.

Typically, cxpacket wait and IO_completion wait at the same time, use sys.dm_os_waiting_tasks to see the Task currently in the waiting state, found that there are some tasks waiting, this description, session (session) occurs, because in SQL Server In parallel, there are some Task execution speeds, and some TASK executes fast, resulting in a fast TASK completion, waiting for a Task that has not been completed; and the slower TASK is performing non-data page IO.

4, WRITELOG (write business logs to hard drive)

Related to the write speed of Disk, indicating that the task is currently waiting to write the log file to the log file, which means that the disk’s write speed is a performance bottleneck.

Five, latch Waiting

Latch is a lightweight synchronization mechanism that allows the data between THREAD to be synchronized when THREAD is trying to read or modify data. SQL Server has three types of latch waiting:

  • PageioLath_xx: Data Page, occurs when reading data from the hard disk to memory
  • Pagelatch_xx: Data page that has been existing in memory (Data Page)
  • LATCH_XX: Data structure acting on non-Page

1, PageioLatch (IO related to hard disk data page)

  • PageIOLatchMainly divided into two categories: Pageiolatch_sh and Pageiolatch_ex
  • PageIOLatch_SH: When you take a Data page from Disk to memory buffer pool. When the user needs to access a Data page, and this data page is not in the memory, SQL Server needs to read Data Page from Disk to memory, indicating that the memory is not large enough, or the memory is tight, resulting in the memory that does not always slow memory. In SQL Server requires too much Page Read (read DATA Page to Memory Buffer Pool from Disk). This situation shows that the memory is Bottleneck.
  • PageIOLatch_EX: The user has modified the Data Page in the memory, and SQL Server needs to write back to DISK, meaning slow to write.

2, Pagelatch (memory data page related IO)

Pagelatch is a Data Page In Memory to synchronize the Data Page data modification operation in the memory buffer pool. When a Task needs to modify the buffer, you must apply PageLatch_ex. Only get this latch can be modified in the buffer in the buffer.

Since the buffer’s modification is completed in Memory, it should be very short each time, and the Pagelatch is only short-lived during the modification. If the Pagelative Waiting is displayed, a large number of concurrent statements are modified to Table, and the modification operation is concentrated on the same page, or there are few Page, these Pages are called HOT PAGE. HOT PAGE is due to the overcast data, and the data distributions on different FILES can reduce the Pagelative Wait.

3, LACTH waiting

Latch Wait to occur in a structure that happens in a neutral page (Non-Data Page), which cannot access the data structure because other threads are in mutective mode.

Non-Page Latch Wait to correspond to latch_sh and latch_ex, sys.dm_os_wait_stats does not exactly display which Latch is a contention point. You can view real-time latch waits in sys.dm_os_waiting_tasks dmv, and DMV will display the name of the latch that is waiting. Alternatively, you can view sys.dm_os_latch_stats DMV to see which latches have the largest aggregation.

Reference documentation:Most common latch classes and what they mean

4, FGCB_ADD_REMOVE latch

5, Access_Method_xx (access method)

Access_method_dataset_parent and access_method_scan_range_generator:When the parallel scan operation occurs, use these two latches to provide a series of page IDs to be scanned for each thread during parallel scanning. For these two waiting, there should be a large number of scans and haveh operations.

ACCESS_METHODS_HOBT_COUNT:This latch is used to refresh the hobt (HEAP or B-TREE) page and row count increment to the storage engine metadata table. Strong will indicate a lot of, small concurrent DML operations on a single table.

Access_method_hobt_virtual_root: usuallyThis latch is used to access metadata containing the page ID (Page ID) containing the index root page (Page ID) when the page is divided. When the root page of the B tree occurs (need to be latched in EX mode), if the thread wants to navigate down the B tree (need to be in the latch in SH mode), the thread must wait, at this time A strive of the latch occurs. This kind of waiting may be due to a large concurrent connection acts in a small index, or page splitting from the random key value causes the cascade page to split (from the leaves to the root). For this waiting, you should check if there is an index of a large number of page splits.

CXPacket Waiting: In the multi-tasking system, the sub-thread to indicate concurrent execution is waiting for the completion of other sub-threads.

SOS_SCHEDULER_YIELD Waiting: The current task is being executed, but it actively retreats the time slice on the scheduler, for other processes, perform tasks in the background.

ThreadPool Wait:The query is waiting for the available Worker thread, there may be a large number of parallel plans to use the system available for the system.

Оцените статью
Master Hi-technology
Добавить комментарий