Fotia Ltd

Fotia Ltd

Monday, 20 November 2006

The TABLESAMPLE Clause

posted by Stefan Delmarco

When I first same across TABLESAMPLE I had visions of some advanced index hints that allowed fine grained control over table statistics (distribution, ranges, density, etc.). Unfortunately not. As per Occam's Razor the simplest explanation is the correct one. It just... returns a sample from the specified table...?

The complete syntax for TABLESAMPLE is as follows (from BOL):

TABLESAMPLE

Given this syntax, the most obvious query to test drive TABLESAMPLE would be to return, say, 10 rows from the Sales.SalesOrderHeader table. After all, this syntax looks a lot like TOP:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (10 rows);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

How unexpected is that! Zero rows returned?! Huh? For the sake of adding to the confusion, change the query to return 1,000 rows as follows:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (1000 rows);

Then run the query a few times and observe how many rows get returned:

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44137 2001-09-01 00:00:00.000 3953.9884
...

(938 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44407 2001-10-16 00:00:00.000 3729.364
...

(1263 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44757 2001-12-01 00:00:00.000 49953.7086
...

(1085 rows(s) affected)

Each execution returns a different number of rows! Some of them just below 1,000 and others just above. Clearly there is a non-deterministic factor at play here. Without any data changing, re-running the identical query keeps giving different results. Digging into the BOL documentation shows that the TABLESAMPLE algorithm has a very strong statistical slant to it. It works as follows:

For the table affected by the TABLESAMPLE, SQL Server needs to determine a probability that a page of that table's data will be sampled. This probability is equal to the PERCENTAGE specified in the TABLESAMPLE. Alternatively, if a number of ROWS was specified, that number is divided by the total number of rows in the table to calculate the probability. In our case, the Sales.SalesOrderHeader table has 31,465 rows. We asked for 1,000 rows in the second query. Therefore the probability is calculated as 3.2%.

Once this probability has been calculated, SQL Server will evaluate it for every page of the table. The details of the algorithm aren't documented but it is along the lines of:

  • For each page, a random number between 0 and 1 is generated (e.g. RAND()).
  • If the random number is greater than the probability, SQL Server will pretend that the page for the table does not exist (i.e. it is not sampled / 'out-of-scope').
  • Otherwise, SQL Server considers all of the rows on the page sampled / 'in scope'.

This is very revealing as it explains the behaviour we saw with the first query. We asked for a sample of 10 rows. SQL Server therefore calculated the probability of a page being returned as:
10 / 31,465 = 0.032 %, a very small number. If we execute the query once, there is a chance SQL Server will not generate a random number that falls within this small probability. However, we can increase our chances of getting at least one page returned by executing the select statement a number of times:

declare @loopCount int;
set @loopCount = 0;

declare @rowcount int;
set @rowCount = 0;

-- Keep looping till we manage to sample at least one page.
while @rowCount = 0 begin
    select SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader tablesample system (10 rows);

    set @rowcount = @@rowcount;
    set @loopCount = @loopCount + 1;
end;

raiserror('Required %d loops to return %d rows', 0, 1, @loopCount, @rowCount);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
69998 2004-05-06 00:00:00.000 2649.8453
...

(46 rows(s) affected)

Required 5 loops to return 46 rows

Interestingly enough, when we get back some rows (i.e. we get a probability hit) we get 91 rows returned instead of 10. This reveals another facet of the TABLESAMPLE algorithm. It only goes as far as the probability calculation of a page for sampling data. Once it has decided that a page of the table is to be included in the sample, it will not limit the number of rows sampled from that page. Therefore you are always liable to get back the number of rows equivalent to the page size divided by the row width (assuming, of course, that the pages are densely packed). Running the above query repeatedly returns 40-45 rows, indicating that a page of the Sales.SalesOrderHeader table typically holds 40-45 rows per page.

A couple of additional points to note:

  • The TABLESAMPLE is applied before any WHERE clauses are evaluated.
  • This includes any JOIN conditions as well. Some misleading results can be obtained if two tables that are JOIN'd have a TABLESAMPLE applied to them. As the sampling of the tables are independent and uncorrelated, there is no guarantee that related rows from both tables will be included.
  • The TOP clause can be combined with TABLESAMPLE to cap the number of rows returned. We saw earlier how the TABLESAMPLE algorithm can return more rows than requested. However, there is still every chance that less rows will be returned than was specified in the TABLESAMPLE.

The REPEATABLE option of the TABLESAMPLE clause basically causes SQL Server to use the same random numbers when evaluating each of the table's pages again (SQL Server is probably using a linear congruential generator to obtain the random numbers with the same seed). The net effect is that the same pages will be sampled assuming no underlying data has been changed.

In terms of practical applications of the TABLESAMPLE algorithm, I really struggled to think of a situation where it would have been useful. At face value, all that TABLESAMPLE does is return a certain number or percentage of 'random-ish' rows from the specified table. Unless you have some highly-specialised requirement to get a 'flavour' for the data in the table, this new clause isn't that useful. It doesn't augment the TOP / ROWCOUNT clause at all and doesn't expose any customisable or extensible behaviour.

I strongly suspect that the TABLESAMPLE clause was added specifically for the optimiser's gathering of statistics. Anyone who has seen STATMAN scrolling past in a SQL Profiler trace will know that the update of statistics uses a number of undocumented TSQL operators and index hints. For example:

select SalesOrderId, OrderDate, TotalDue
from (
    select top 100 percent SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader with (readcommitted, sample 10e-2 percent)
    order by 1, 2, 3)
option (bypass optimizer_queue, maxdop 1);

I count at least 2 undocumented operators and 1 view ordering workaround being used in this query. TABLESAMPLE is definitely a much cleaner implementation for a query that appears to be performing the same function.

Who knows, maybe one day your boss will ask your team: "You know what, it would be really cool if we could return random pages from this table, according to a probability we specify, get back different results every time we execute it, and even return vastly different numbers of rows than what we asked for". You'll be the first one to raise your hand...

Labels: ,

Previous Posts