Manual update of the statistics

Low a sampling rate for Update Statistics can have the effect of causing the optimizer to choose a bad plan that would lead to an inaccurate creation of the statistics. Changing to Update Statistics with Full Scan solves the issue.

However, full scan statistics update can be slow and expensive on a large system, and a single update statistics job for a whole database can easily exceed an overnight maintenance window. This gave me the idea to create jobs that could update statistics on the system in parallel, as follows.

The idea here is to make all subsystems work on your server to update as many statistics objects as possible, within a defined maintenance window, in order from the worst cases to the best. The hope is that the process can run nightly, and will chip away at all the indexes and statistics objects in order. If the maintenance window runs out, it’ll stop and not interfere with the server’s performance during the day. When the next evening rolls around, it will begin again, but will work on the worst cases first, which hopefully will include any objects for which there was not time to update statistics the previous night. On a large system, the statistics should at least be updated every few nights, in a round-robin fashion.

The design consists of a collection of SQL Agent jobs: one is a controller” that starts and stops the process, and the others are “workers” that perform the updates against collections of indexes. The controller establishes a queue of objects to be updated, and the workers pull items from that queue until they run out or the controller stops the process at the end of the maintenance time period.

Step 1: Establish a Set of Indexes to Update

To implement this pattern, we begin with a select statement that contains all the indexes and statistics objects needed, with data about their status and size:

select
schemas.name as table_schema,
tbls.name as table_name,
i.name as index_name,
i.id as table_id,
i.indid as index_id,
i.groupid,
i.rowmodctr as modifiedRows,
(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) as rowcnt,
stats_date( i.id, i.indid ) as lastStatsUpdate,
‘False’ as Processed
into ##updateStatsQueue
from sysindexes i
inner join sysobjects tbls on i.id = tbls.id
inner join sysusers schemas on tbls.uid = schemas.uid
inner join information_schema.tables tl
on tbls.name = tl.table_name
and schemas.name = tl.table_schema
and tl.table_type=’BASE TABLE’
where 0 < i.indid and i.indid < 255
and table_schema <> ’sys’
and i.rowmodctr <> 0
and (select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) > 0

This statement will query mainly the sysindexes table to create a list of all indexes and statistics objects, for non-system tables, with each object’s

1)  Last statistics update time

2) Number of rows in the underlying table

3) Number of rows modified since the last statistics update.

The results are restricted only to those objects that have changed since the last statistics update, by choosing only rows where sysindexes.rowmodcounter is nonzero. (Note that this value can be negative in some cases, so I am deliberately using “<> 0.”)

There is one tricky bit to this: objects that are strictly statistics — as opposed to indexes — do appear in sysindexes, but always with a row count (sysindexes.rowcnt) of zero; sysindexes does not report the row count in the underlying table. In order to prioritize the queue I am generating, I would like to be able to compute how many modified rows there are out of the total number of rows in the underlying table, that is rowmodctr / rowcnt. However, we’d get either nonsense or a divide by zero error for objects that are statistics-only, since rowcnt is always zero. For that reason, I have this subquery:

select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2

This will transform the row count for the statistics objects into the row count for the underlying table.

After the select statement runs, we have a global temporary table containing a list of indexes to update. It’s important that the table have global scope, because the controller job is going to need to share it with the worker jobs actually performing the updates. Next, I “wrap” this select with some other code to control how long this process is allowed to run, and to start and stop our “worker” jobs, and put the code into a SQL Agent job:

declare @starttime datetime;
set @starttime = getdate();

– Number of minutes to continue updating stats
– Routine will complete its current operation and exit
– if more than this timespan has elapsed:
declare @maxDur decimal (5,2);
set @maxdur = 480;

print ‘Updating statistics in ‘ + DB_NAME();
print ”;

set nocount on;

– Make the queue of indexes to update:

if object_id(’tempdb..##updateStatsQueue’) is not null drop table ##updateStatsQueue;

select
schemas.name as table_schema,
tbls.name as table_name,
i.name as index_name,
i.id as table_id,
i.indid as index_id,
i.groupid,
i.rowmodctr as modifiedRows,
(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) as rowcnt,
stats_date( i.id, i.indid ) as lastStatsUpdate,
‘False’ as Processed
into ##updateStatsQueue
from sysindexes i
inner join sysobjects tbls on i.id = tbls.id
inner join sysusers schemas on tbls.uid = schemas.uid
inner join information_schema.tables tl
on tbls.name = tl.table_name
and schemas.name = tl.table_schema
and tl.table_type=’BASE TABLE’
where 0 < i.indid and i.indid < 255
and table_schema <> ’sys’
and i.rowmodctr <> 0
and (select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) > 0;

– Start worker jobs to process the queue

print ‘Starting worker jobs’;

– Change the job names below as needed for implementation:
exec msdb..sp_start_job @job_name = ‘Update Stats Worker 1′;
exec msdb..sp_start_job @job_name = ‘Update Stats Worker 2′;
exec msdb..sp_start_job @job_name = ‘Update Stats Worker 3′;
exec msdb..sp_start_job @job_name = ‘Update Stats Worker 4′;

– Monitor the queue; remove its temp table and exit once the entire table
– has been processed or the max duration has elapsed:

while ( datediff( minute, @starttime, getdate() ) < @maxdur
and exists( select 1 from ##updateStatsQueue where processed = ‘False’ ) )
begin
– Pause while worker jobs update stats
print ‘Working ‘ + cast( getdate() as varchar(50) );
waitfor delay ‘000:02:00′;
end

drop table ##updateStatsQueue;

print ”;
print ‘Statistics update stopped/ended ‘ + cast( getdate() as varchar(50) );

This will create the queue as a global temp table, then start four other jobs that will perform the work. When either the maximum duration has elapsed or the workers have processed the whole queue, it will drop the temp table, which will cause the worker jobs to stop and exit.

Step 2: Code the Worker Jobs to Prioritize Updates

Next, I create four worker jobs, each with code similar to the following. The worker job works in a loop: it will first “claim” the top item from the queue table according to criteria we can customize. In this case I am mainly looking at objects where the number of modifications since the last stats update is a high percentage of the total number of rows in the underlying table. It will then compose a dynamic SQL statement that updates the statistics, after which it repeats until either all the objects in the queue have been processed, or the controller job has dropped the queue table, indicating that work should cease:

– Process statistics objects from queue ##updateStatsQueue
– until it has been completely processed or no longer exists

print ‘Updating statistics in ‘ + DB_NAME()
print ”

declare @curtable sysname;
declare @curschema sysname;
declare @curindex sysname;
declare @curmodrows int;
declare @currows bigint;
declare @sql varchar(2000);

– Loop until we run out of stats objects or the queue is dropped
while ( object_id(’tempdb..##updateStatsQueue’) is not null )
begin

– Get the top stats object to work on, and remove its entry from the queue

update ##updateStatsQueue
set @curschema = table_schema = q.table_schema,
@curtable = table_name = q.table_name,
@curindex = index_name = q.index_name,
@currows = rowcnt = q.rowcnt,
@curmodrows = modifiedRows = q.modifiedRows,
processed = ‘True’
from ##updateStatsQueue q
inner join ( select top 1 *
from ##updateStatsQueue q3
where processed = ‘False’

– Note: You may wish to add criteria or change the order clause here to
– tune the ordering of statistics updates for your specific server:
– For example: “and groupid = (x)” or “and table_name like ‘[a-g]%’”
– The ordering given is from tables with a large percentage modified rows
– to tables with a small percentage modified rows (preventing divide by zero)

order by abs(modifiedrows)/( cast( rowcnt as decimal ) + 0.01 ) desc, lastStatsUpdate
) q2
on q.table_schema = q2.table_schema
and q.table_name = q2.table_name
and q.index_name = q2.index_name;

– If the update statement found no matching rows, then our work is done; exit:
if @@ROWCOUNT = 0 break;

else
begin
print cast( getdate() as varchar(100) );
print ‘Updating stats for ‘
+ ‘[' + @curschema + '].[' + @curtable + '].[' + @curindex + ']‘
+ ‘(’ + cast( @curmodrows as varchar(1000) ) + ‘ modifications against ‘
+ cast( @currows as varchar(1000) ) + ‘ rows.)’;

set @sql = ‘update statistics ‘
+ ‘[' + @curschema + '].[' + @curtable + '] [' + @curindex + ']‘
+ ‘ with fullscan ‘

exec( @sql );

end
end

print ”
Print ‘Statistics update stopped/ended ‘ + cast( getdate() as varchar(50) );

In this case the only fancy bit is the syntax of the update statement against the queue table: it sets local variables for the name of the object that the worker will be updating and marks the row in the queue table so other workers will not use it, all in one statement, to prevent contention for or deadlocks on the queue table.

Step 3: Schedule the Jobs

Finally, these jobs need a schedule. Only the “controller” job should be executed on a schedule; it will start the worker jobs after the queue table is created, and the workers will stop on their own when they have no more queue entries to process. For this reason, I have a note in the description of the worker jobs to the effect, “This job is deliberately unscheduled and is invoked by job ‘x’.” (I hope that will prevent a well-intentioned teammate from adding a schedule to the worker jobs, which would not technically cause problems but could be confusing.) I also set the Agent jobs to write output to a log file, so that I can review the output of the Print statements in the jobs and ensure that everything is running as expected.

Note that the controller job has a “maximum duration” variable that can be adjusted to fit the maintenance window for a system. No new statistics updates will be started after that time has elapsed, but worker jobs that are in the middle of an update will run it to completion before they stop. A long-running update that starts right at the end of the maintenance window will keep running, so take that into account when setting your schedule. Note that the largest tables are not necessarily processed first, the most heavily modified tables are, so there is likely to be a large table at the end of the list. This is even more true in that, percentage-wise, a large table might be less likely to have a large percentage of modified rows. For best results time the controller job to stop before the end of your maintenance window, by some interval that will allow the workers to complete the update they are currently executing.

Notes on Performance and Older Versions of SQL Server

Its important to note that this procedure, right out of the box, might not perform better than a “single-threaded” version on all systems, without some tweaks. If you consider implementing this, be sure to think through these issues:

* Does your system have fast enough storage infrastructure and enough memory so that parallel statistics updates will better utilize the hardware than serial updates? It could be that contention for disk or memory makes this solution less optimal.
* Would it help to alter the ordering of updates, for example to isolate objects from one filegroup using one worker and from another filegroup, on different disks, using another worker?
* Are multiple workers updating statistics on the same underlying tables at the same time? I believe that SQL Server will serialize statistics updates from different connections that hit the same table, so simultaneous updates might be faster or might be slower depending on the details of how it does that, and what advantage there is from having the table in cache as opposed to re-fetching it from disk.

If you are running a version of SQL Server 2000 SP3a or earlier, there is a known issue where concurrent statistics updates against the same underlying table can deadlock on the sysindexes table. This will have the effect that some of the worker jobs will be killed as a result of deadlock detection, leaving just one (one worker cannot deadlock against itself). For this case, consider SP4 or change the code that the worker jobs use to select from the queue to make them pick indexes on mutually exclusive tables (for example, select for tables named like ‘[a-g]%’ for one worker and ‘[h-m]%’ for the next, and so on. See http://support.microsoft.com/kb/826754.

Managing outdated statistics on very small tables

What neither SQL Server 2000 nor SQL Server 2005 solve yet is to manage outdated or obsolete statistics on tiny tables (in terms of number of rows).

This is much easier to handle with huge tables. If you add one row to a 1-million row table it will very likely not matter regarding the data distribution. But if you add 1 row to a 1-row table you double the data.

The key question is: how often should the statistics be updated ? If it happens too often (e.g. with every new single insert, update or delete) it might thrash the whole system.

Schedule autoupdate statistics

I’d recommend the following:

1) Turn autoupdates off (they cause perf overhead) and opening up the Fill Factor in the index from 80% (default) to 50% or lower, which will leave more room to grow the index.

2) Schedule autoupdate statistics runs cyclicly as needed.

Auto Stats may corrupt Statistics and decrease performance

Another reason to turn off Auto Update Statistics

We have found (beyond a reasonable doubt - even at a certain company’s internal SQLServer databases) that statistics for SQLServer seem to get corrupt from the optimizers point of view over time.

One theory is that auto stats may contribute to that. It’s gotten so bad on some of our databases that we have to update stats every hour in order to be sure we ‘fix’ the situation as soon as possible after it happens. We plan on testing the difference with auto statistics on vs off on our databases and see what kind of performance differences we get and if there is any noticeable drop in corrupted statistics.

Use the trace flag 8721 to determine how often that AUTOSTATS runs

One way to determine how often that AUTOSTATS runs in SQL runs is to use the trace flag 8721. When turned on, it outputs to the errorlog every time AUTOSTATS runs. This information can be used to help determine if AUTOSTATS is running too often or not. Be sure to turn off this trace flag when you are done with it, as it can produce unnecessary overhead.

Find if Statistics is missing or is outdated

One way to tell if the statistics for a column are missing or outdated, or if the accuracy of current statistics is not ideal (they may not be ideal if the sample size of the statistics are not high enough), is to run a graphical execution plan of a query as an estimated plan and as a real plan. In other words, you will run both an estimated and an actual graphical execution plan using Query Analyzer, and then compare the results.

When you run an estimated graphical execution plan for a query, you get results, such as the estimated row count returned. When you run an actual graphical execution plan for a query, you get the actual row count returned. If the statistics for the column(s) used by the query are current and accurate, then the estimated row count and the actual row count should be very similar. If they are wildly different, this can mean three things. Either the statistics are missing, out of date, or not accurate enough.

First, check to see if the statistics are missing. If they are, they can easily be added.

Second, if statistics do exist, then update them to see if this resolves the problem.

Third, if the first two suggestions don’t work, then the statistics may not be accurate enough.

By default, when a table is less than 8 MB, the sample used to create the column statistics is the entire table. So for tables of less than 8 MB, in theory the statistics created can’t be any better than they are. So if your table is less than 8 MB, and updating the statistics doesn’t resolve the issue of a large difference between the estimated and actual row count, then nothing can. In a case like this, your only option, assuming there is a performance problem, is to use a hint to override what the Query Optimizer is doing wrong.

But if the table in question is greater than 8 MB, and when SQL Server automatically creates statistics for a table, it does not examine every row. Instead, it takes a sample of the available rows. This sample may or may not be good enough to produce accurate enough statistics. One way to get better statistics is to manually update the statistics for the table using the UPDATE STATISTICs command. This command has two options that allow you to get a better sampling of the rows in your table.

First, you can use the FULLSCAN option. This will force every row in the table or index to be fully examined and used to build the column statistics. As you might guess, using this option can increase the time it takes to update the statistics, which could hurt performance elsewhere on your server, especially if the table is huge.

As an alternative to the FULLSCAN option, you can use the SAMPLE option. This allows you to control how large a sample of rows to sample. For example, you might specify that only 50 percent of the rows are to be sampled. This is less than what a FULLSCAN will produce, but more than what a default sample will produce, offering a compromise in accuracy and resources needed to perform the sample.

If you find that the statistics for one or more of our tables is not accurate enough for your needs, you may need to schedule running the UPDATE STATISTICS command at regular intervals using an appropriate sampling level. Only through testing will you know what sampling level is best for your situation.

You should update statistics for all tables and for all databases every night.

According to a SQL Server tuning expert at Microsoft (one of their best), it was suggested to me that if your SQL Server maintenance window allows for it, that you should update statistics for all tables and for all databases every night.

This is because the auto-update option is less than perfect, and accurate and up-to-date statistics is so important to overall performance of SQL Server. Of course, if this is not possible because of time constraints, then you will have to perform this step less often, or even not at all, and rely on the auto-update feature. But if you do have this window, you should take advantage of it.

Verify that statistics is updated - first step in query performance optimization

If you have a performance issue with a query, the very first step you should take before analyzing it is to update the statistics on all the tables in your database.

This is because many query performance issues are due to outdated statistics, and updating them before troubleshooting query performance may save you a lot of time.

If the query is still giving you trouble after updating the statistics, then you can begin your troubleshooting.

DBCC SHOW_STATISTICS command (find out if an index has had its indexes updated or not)

If you want to find out if an index has had its indexes updated or not, you can use the DBCC SHOW_STATISTICS command, like this:

DBCC SHOW_STATISTICS (table_name , index_name)

This command will not only tell you when statistics were last updated, but also provide you information on the kind of statistics that has been collected for the index you are examining.

Statistics are only maintained on the first column of a composite index.

Statistics are only maintained on the first column of a composite index. Because of this, SQL Server’s Query Optimizer might not make as good use of composite indexes as it could. If you suspect that a composite index is not being used as you expected it would be, you will want first to verify that the index is not being used by examining the query’s execution plan.

If you see that it is not being used, and it should be, consider adding a hint to force the index to be used, then see if using the index really helps performance. If it does, keep the hint. If it does not help performance, drop the hint and the composite index that is not being used, assuming that it is not needed by another query.