Search This Blog

Wednesday, April 27, 2011

Finding optimal number of CPUs for a given long running CPU intensive queries (except OLAP queries)

Hi Guys ,

Hope this article will help you in some or the other way one day :) .....

This small article is applicable for finding optimal number of CPUs for long running CPU intensive queries/workload that doesn’t frequently wait for other resources and is not applicable if your queries/workload is often waiting for resources (like I/Os, Locks, Latches etc.) without consuming CPU in a stretch .it can also provide information on uneven CPU load across NUMA nodes and uneven CPU load within same NUMA node (load_factor effect).
It is recommended to analyze Windows Performance Monitor Counters for monitoring CPU pressure. Processor utilization greater then 75% to 80% indicates CPU pressure. Using Windows Performance Monitor should be the 1st step, the procedure suggested in this article should be considered as an additional step.
Further ,it is very important to find ways to optimize the queries/workload by tuning the database schema before attempting to add additional CPUs.

When a customer asks you: I am running a resource consuming SQL job and it takes x amount of time, how can I reduce the time so the SQL job completes sooner, can I add more CPUs ? if yes, how many ?
When you see CPU pressure, there are 2 options: you can either upgrade to faster CPUs or add additional CPUs [assuming that the queries are well tuned and normalized]. Upgrading to faster CPU will always help. Adding additional CPUs may not always help the SQL job to run faster unless that SQL job can take advantage of additional CPUs [read Max Degree of parallelism form BOL]. If the customer already has the fastest CPUs available in the market then they have to wait for the next release of faster CPUs. One more choice woiuld be to add additional CPUs and see if it helps, the below procedure will help you identify if this is the case.
This method calculates total user waits for CPU during the SQL workload and suggests additional CPUs if necessary. If CPU usage is at 100%, but no one waited for CPU during the workload, then adding additional CPU will not help; this is the basics of this calculation.
Current recommendations that are available on this topic calculates ‘signal wait time’ to ‘wait time’ ratio to suggest CPU pressure – but this cannot help one easily identify number of additional CPUs necessary.

When concurrent users apply simultaneous CPU intensive workload, there could be CPU pressure. We can conclude presence of CPU pressure when at any given moment during this time period at least one or more user tasks waited for CPU resource.
In this case one can run the below query to find out how many CPU on an average will help to scale(out) the workload better. It might be more informative to collect the below information in short time intervals (many samples) than just once to understand during which time of the workload application the CPU pressure was the most. Single sample will lead to average additional CPUs necessary for the entire workload duration.
1. Reset Wait Stats
dbcc sqlperf('sys.dm_os_wait_stats', clear)
2. Apply workload (you can find sample workload query at the end of this article, you need to execute the sample workload query simultaneously in many sessions to simulate concurrent user tasks).
3. Run the below query to find Additional CPUs Necessary – it is important to run the query right after the workload completes to get reliable information.

select round(((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count), 2) as Additional_CPUs_Necessary,
round((((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count) / hyperthread_ratio), 2) as Additional_Sockets_Necessary
from sys.dm_os_wait_stats ws cross apply sys.dm_os_sys_info si where ws.wait_type = 'SOS_SCHEDULER_YIELD'

When you have 2 CPUs and you run the sample workload with just 1 or 2 concurrent sessions – you will see no recommendation for addition additional CPUs – unless there is unbalanced user task distribution across CPUs. On the other hand if you run the workload with 4 concurrent sessions – you will notice the query suggests you to add 2 additional CPUs. If you run with 6 concurrent sessions – you will notice the query suggests you to add 4 additional CPUs.
If each workload runs in parallel (MAXDOP not 1), then you will notice additional CPU suggestion, you need to be careful in this case. For example with 2 CPUs when you run the workload (in parallel, MAXDOP 0/2) with 2 concurrent sessions, you will notice the suggestion to add 2 additional CPUs – this just indicates the workload is more scalable with more CPUs – parallel query execution as you can imagine can consume as many CPUs as you have and can consume more!!
The results are not reliable when other applications are running in the system. Also the results might be incorrect on a hyper threading enabled system.

When there are more user tasks concurrently needing CPU than available CPU, the excess user tasks will wait for CPU (there are exceptions when the workload is not evenly distributed across CPUs). In this case each user task uses its quantum, then goes into a wait state (waiting for CPU with wait_type SOS_SCHEDULER_YIELD. sys.dm_exec_requests doesn’t show this wait type, probably by design to avoid showing user tasks in wait state when they are waiting for CPU. But sys.dm_os_wait_stats will include these waits) until all other runnable user tasks have used their quantum. If one measures how many tasks went into this wait state and for how long while the workload was applied – it is possible to calculate the CPUs necessary to scale the workload better.
runnabkle_task_count from the dm_os_schedulers is also a indication for CPU pressure, but it is just a probe – one cannot reasonably conclude the number of CPUs necessary for a given workload.

There is an exception(for OLTP like workload) where a user tasks doesn’t consume all of its quantum(goes into some other wait state before the quantum expires, waiting for I/Os, Locks, Latches etc.) in a stretch, but continues to run in a loop using CPU without using its full quantum(You know what quantum is ...right :D). The method mentioned here cannot calculate the necessary additional CPUs in this case.. Most common example is short transactions using part of its quantum and starts WRITELOG waits and continues in a loop – inserts using implicit transactions in a loop is a typical example.

Sample Workload:
Create the below table before running the query to generate CPU intensive workload.
Serial Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2 option (maxdop 1)
Parallel Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2 
create table tab7 (c1 int primary key clustered, c2 int, c3 char(2000))
begin tran
declare @i int
set @i = 1
while @i <= 5000
insert into tab7 values (@i, @i, 'a')
set @i = @i + 1
commit tran

Happy Learning !!!