## Monday, 20 November 2006

# The TABLESAMPLE Clause

posted by Stefan Delmarco

When I first same across TABLESAMPLE I had visions of some advanced index hints that allowed fine grained control over table statistics (distribution, ranges, density, etc.). Unfortunately not. As per Occam's Razor the simplest explanation is the correct one. It just... returns a **sample** from the specified **table**...?

The complete syntax for TABLESAMPLE is as follows (from BOL):

Given this syntax, the most obvious query to test drive TABLESAMPLE would be to return, say, 10 rows from the Sales.SalesOrderHeader table. After all, this syntax looks a lot like TOP:

from Sales.SalesOrderHeader tablesample system (10 rows);

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

**(0 rows(s) affected)**

How unexpected is that! Zero rows returned?! Huh? For the sake of adding to the confusion, change the query to return 1,000 rows as follows:

from Sales.SalesOrderHeader tablesample system (1000 rows);

Then run the query a few times and observe how many rows get returned:

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

44137 | 2001-09-01 00:00:00.000 | 3953.9884 |

... |

**(938 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

44407 | 2001-10-16 00:00:00.000 | 3729.364 |

... |

**(1263 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

44757 | 2001-12-01 00:00:00.000 | 49953.7086 |

... |

**(1085 rows(s) affected)**

Each execution returns a **different** number of rows! Some of them just below 1,000 and others just above. Clearly there is a non-deterministic factor at play here. Without any data changing, re-running the identical query keeps giving different results. Digging into the BOL documentation shows that the TABLESAMPLE algorithm has a very strong statistical slant to it. It works as follows:

For the table affected by the TABLESAMPLE, SQL Server needs to determine a probability that a page of that table's data will be sampled. This probability is equal to the PERCENTAGE specified in the TABLESAMPLE. Alternatively, if a number of ROWS was specified, that number is divided by the total number of rows in the table to calculate the probability. In our case, the Sales.SalesOrderHeader table has 31,465 rows. We asked for 1,000 rows in the second query. Therefore the probability is calculated as 3.2%.

Once this probability has been calculated, SQL Server will evaluate it for every page of the table. The details of the algorithm aren't documented but it is along the lines of:

- For each page, a random number between 0 and 1 is generated (e.g. RAND()).
- If the random number is greater than the probability, SQL Server will pretend that the page for the table does not exist (i.e. it is not sampled / 'out-of-scope').
- Otherwise, SQL Server considers
**all**of the rows on the page sampled / 'in scope'.

This is very revealing as it explains the behaviour we saw with the first query. We asked for a sample of 10 rows. SQL Server therefore calculated the probability of a page being returned as:

10 / 31,465 = 0.032 %, a very small number. If we execute the query once, there is a chance SQL Server will not generate a random number that falls within this small probability. However, we can increase our chances of getting at least one page returned by executing the select statement a number of times:

set @loopCount = 0;

declare @rowcount int;

set @rowCount = 0;

-- Keep looping till we manage to sample at least one page.

while @rowCount = 0 begin

select SalesOrderId, OrderDate, TotalDue

from Sales.SalesOrderHeader tablesample system (10 rows);

set @rowcount = @@rowcount;

set @loopCount = @loopCount + 1;

end;

raiserror('Required %d loops to return %d rows', 0, 1, @loopCount, @rowCount);

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

**(0 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

**(0 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

**(0 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

**(0 rows(s) affected)**

SalesOrderId | OrderDate | TotalDue |

------------ | ----------------------- | --------------------- |

69998 | 2004-05-06 00:00:00.000 | 2649.8453 |

... |

**(46 rows(s) affected)**

**Required 5 loops to return 46 rows**

Interestingly enough, when we get back some rows (i.e. we get a probability hit) we get 91 rows returned instead of 10. This reveals another facet of the TABLESAMPLE algorithm. It only goes as far as the probability calculation of a page for sampling data. Once it has decided that a page of the table is to be included in the sample, it will not limit the number of rows sampled from that page. Therefore you are always liable to get back the number of rows equivalent to the page size divided by the row width (assuming, of course, that the pages are densely packed). Running the above query repeatedly returns 40-45 rows, indicating that a page of the Sales.SalesOrderHeader table typically holds 40-45 rows per page.

A couple of additional points to note:

- The TABLESAMPLE is applied before any WHERE clauses are evaluated.
- This includes any JOIN conditions as well. Some misleading results can be obtained if two tables that are JOIN'd have a TABLESAMPLE applied to them. As the sampling of the tables are independent and uncorrelated, there is no guarantee that related rows from both tables will be included.
- The TOP clause can be combined with TABLESAMPLE to cap the number of rows returned. We saw earlier how the TABLESAMPLE algorithm can return more rows than requested. However, there is still every chance that less rows will be returned than was specified in the TABLESAMPLE.

The REPEATABLE option of the TABLESAMPLE clause basically causes SQL Server to use the same random numbers when evaluating each of the table's pages again (SQL Server is probably using a linear congruential generator to obtain the random numbers with the same seed). The net effect is that the same pages will be sampled assuming no underlying data has been changed.

In terms of practical applications of the TABLESAMPLE algorithm, I really struggled to think of a situation where it would have been useful. At face value, all that TABLESAMPLE does is return a certain number or percentage of 'random-ish' rows from the specified table. Unless you have some highly-specialised requirement to get a 'flavour' for the data in the table, this new clause isn't that useful. It doesn't augment the TOP / ROWCOUNT clause at all and doesn't expose any customisable or extensible behaviour.

I strongly suspect that the TABLESAMPLE clause was added specifically for the optimiser's gathering of statistics. Anyone who has seen STATMAN scrolling past in a SQL Profiler trace will know that the update of statistics uses a number of undocumented TSQL operators and index hints. For example:

from (

select top 100 percent SalesOrderId, OrderDate, TotalDue

from Sales.SalesOrderHeader with (readcommitted, sample 10e-2 percent)

order by 1, 2, 3)

option (bypass optimizer_queue, maxdop 1);

I count at least 2 undocumented operators and 1 view ordering workaround being used in this query. TABLESAMPLE is definitely a much cleaner implementation for a query that appears to be performing the same function.

Who knows, maybe one day your boss will ask your team: *"You know what, it would be really cool if we could return random pages from this table, according to a probability we specify, get back different results every time we execute it, and even return vastly different numbers of rows than what we asked for"*. You'll be the first one to raise your hand...