jump to navigation

SQL Server 2005 I/O issues – Part 1 November 3, 2008

Posted by msrviking in Performance tuning.
trackback

I had this issue presented to me by one of my customers. Alright what is the issue? – “During peak hours of the day the performance of the application slows down.” With this single statement and few other pointers on the application environment I was requested to hunt down the problem from a database perspective.

As I started looking into what could be the cause of performance issues “if any” on the database server, I was getting multiple ideas, but was unable to pin-point what could be the cause, where and what to look for.

So I requested my system administrators to help me in finding what is alarming – CPU, Memory or Disk subsystem on the box. If I know where the cause through the performance counter is, I will know what kind of queries should be run against the instance DMV’s. My request was probably the toughest with no indicators around still, I went and took the armory I use for finding these type issues. Before I used the armory, I categorized the problematic areas as

  • CPU
  • Memory
  • Disk Subsystem

I ran around 7 out of 13 emergency diagnostic queries (some of these queries are from Glenn Berry’s list, some from the Troubleshooting Performance in SQL Server 2005, and some from Working with tempdb in SQL Server 2005) and here is what are all these about, what did yield and what could we do further.

My other thoughts were to wait for reports from Windows front to conclude further on what needs to be done. I was also thinking to have the performance counters for CPU, Disk Reads, Writes, RAM to be configured for logging so that we could arrive at better conclusions.

My findings and inferences were as below

— HIGH CPU *******

— Isolate top waits for server instance

What are the results?

There is one wait type by SQLTRACE_BUFFER_FLUSH, wait time is 2037408, and running percentage 97.45.

What do we infer?

Occurs when the SQL Trace flush task pauses between flushes. This wait is expected and long waits do not indicate a problem.

What do we do?

One thing that is making me wonder is why do we see this kind of wait type. I haven’t thoroughly analyzed on this part, and I shall get back with more inputs in case if it’s going to be worth enough.

— HIGH CPU *******

— Check SQL Server Schedulers to see if they are waiting on CPU (runnable tasks above 10 is very bad)

What are the results?

None of the cpus were busy when I ran this query. The values of runnable tasks were zero.

What do we infer?

The run of this script was not at the right hour, and the information gathered at off peak hours (AM hours) wasn’t judgeable.


What do we do?

We should probably run the script during the peak hours for ½ to find if any of the cpu cores is facing issues.

— TRANSACTION LOG FULL *****

— Find the log reuse description for the transaction log

What are the results?

None of the transaction logs of the user and tempdb were waiting for any operations to complete, when I ran this query. The values of run able tasks were NOTHING.

What do we infer?

The run of this script was not at the right hour, and the information gathered at off peak hours (AM hours) wasn’t judgeable.

What do we do?

We should probably run the script during the peak hours for ½ to find if any there are performance issue faced by transaction logs to write the data.

— IO ISSUES *****************

— Analyze DB IO, ranked by IO Stall %

This query returned the most interesting results.

What are the results?

The query returned parameters like which database and which file of the database is having issues. The most prominent database that had high stall value and percentage was tempdb, and the log file being the most victimized.

File type io_mb io_stall_s io_stall_pct

log 12063.21 14488.00 23.76

data 69810.41 13228.62 21.69

What do we infer?

The query returns a value called io_stall which means “Total time, in milliseconds, that users waited for I/O to be completed on the file”. The value we get in the above results is in seconds. I have the feeling that tempdb log file is facing issues while writing data.

This kind of stalling would happen when the configuration of the log file is not optimal to perform the kind of operations generated by the application calls or the disk subsystem is tiring up doing the operations of tempdb log file.

What do we do?

My advice was to monitor tempdb utilization, and monitor the disk subsystem (drive counters) where the tempdb is configured for sometime before we make any changes. We don’t have any threshold values defined to say if the above numbers are concerning are not, but further monitoring would help to deduct further.

As I was writing this post, I thought I should keep this as Scene – I of the Act – I and will shortly post on Scene – II of the Act – I.

Until then enjoy reading, and let me know if you have other ways to find out such issues.

Advertisements

Comments»

1. SQL Server 2005 I/O issues - Part 2 « Viking’s Weblog - November 4, 2008

[…] Comments (RSS) « SQL Server 2005 I/O issues – Part 1 […]

2. Paul - April 29, 2010

I have a userDB with a very high io_stall_pct, ~70%, what can be done to alleviate this ? with tempdb, I would add additional devices. All DB datafiles are on a RAID 0+1 SAN, configured as separate logical drives for data and log, comm with the infrastructure group stinks, so not many changes can be made there.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: