Fotia Ltd

Fotia Ltd

Monday, 23 April 2007

The NEWSEQUENTIALID Function

posted by Stefan Delmarco

The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance and globally unique values.

Insert Performance and Uniqueness

The best insert performance will always be achieved by using an identity column as the clustering key. With a monotonically increasing identity column as the clustered index, data is always inserted at the end of the table. At a basic level this is equivalent to the performance of a table that has no clustered index (i.e. a heap) where the insert position for the next row is the page where the last row was inserted. Secondly, there will be no page splits. With a normal clustered index, SQL Server will move half the rows on a full page to a newly allocated page in order to make space for the new row. With an identity column on a clustered index SQL Server will not 'split' the page in this manner. It will just allocate another page and continue inserting rows. It has enough smarts to know not to copy half the rows to the newly allocated page as all new rows will be inserted into the new page.

The limitations of identity columns come to the fore when you need have unique keys that span either tables, databases or SQL Servers. For example, say you have multiple branches that can each generate orders. Each branch has a replicated copy of the Orders table, however you don't want duplicate order numbers generated across branches. Common solutions to this problem include creating a composite order number that includes the branch id, reserving ranges of the identity column or making use of a central 'number fountain'. Each of these approaches introduce a level of indirection that must be adhered to before a row can be created. If the order number is not required to be human-readable friendly, GUIDs are often used instead.

GUIDs

GUIDs (Globally Unique Identifiers) are pseudo-random 16-byte (128-bit) numbers whose generation algorithm provides a sufficiently high degree of probability that the same GUID will never be generated twice on any computer at any time. Although a GUID is a binary array, it is often represented in its hexadecimal form using the following format, for example: dbbc2827-edf8-4a2d-92ad-c1e0059304d7. This makes them suitable for the 'distributed uniqueness requirement' of our Orders table.

The problem would be pretty much solved there unless we have some non-trivial performance requirements, especially around insert volumes and speed. As GUIDs are effectively pseudo-random (the degree of randomness / predictability was changed by Microsoft in response to privacy concerns) they do not possess the same orderly insert benefits as identity columns when used as the clustering key. The insert IO pattern changes from sequential to random-access when using GUIDs instead of an IDENTITY column. Not only do GUIDs require far more IO to locate the insert location in the clustered index (the index has to be traversed from the root for every insert), fragmentation at the leaf level leads to less dense pages, which in turn requires defragmentation / reindexing.

NEWSEQUENTIALID()

Enter NEWSEQUENTIALID(), stage left. This is a new system function included in SQL Server 2005 that combines the best of both worlds. NEWSEQUENTIALID() will generate a GUID that will be greater in value than the previously generated one.

It is easy to see how this function works by looking at the output the following TSQL script generates:

create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int);
go

-- Insert 100 rows.
declare @count int;
set @count = 0;
while @count < 100 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

select *
from TestTable
order by id;
go

id sequence
------------------------------------ -----------
FA780E3B-03C2-DB11-BD9F-0011D82F3F23 0
FB780E3B-03C2-DB11-BD9F-0011D82F3F23 1
FC780E3B-03C2-DB11-BD9F-0011D82F3F23 2
FD780E3B-03C2-DB11-BD9F-0011D82F3F23 3
FE780E3B-03C2-DB11-BD9F-0011D82F3F23 4
...  
5C790E3B-03C2-DB11-BD9F-0011D82F3F23 98
5D790E3B-03C2-DB11-BD9F-0011D82F3F23 99

You'll notice that there is a clear bit-shifting pattern that SQL Server is employing to keep the GUIDs increasing in relative value. These are the restrictions in NEWSEQUENTIALID's use:

  • NEWSEQUENTIALID cannot be used in arbitrary TSQL statements. It can only be specified as the DEFAULT value for an uniqueidentifier column
  • More than one column in a table can use NEWSEQUENTIALID
  • It cannot be combined with other scalar functions e.g. REVERSE(NEWSEQUENTIALID()) is not allowed

The network card's MAC address is easily identifiable in the generated GUID. My laptop's MAC address is

   00-01-4A-28-64-8B

This is one of the NEWSEQUENTIALID GUIDs that was generated:

   4EAC7708-30C3-DB11-B902-00014A28648B

Hence the privacy warnings in BOL:

NEWSEQUENTIALID() Privacy Warning

Correlating the position of the MAC address in the GUID with the bits that get incremented / shifted for each successive GUID sheds light on why the GUIDS remain globally unique. The MAC address portion of the GUID does not change.

Performance Comparison

The following test demonstrates the benefits of NEWSEQUENTIALID. The insert performance of the various clustering key variations was measured and compared. For each test the same TestTable was recreated with a different Id column definition:

  • IDENTITY(,)
  • NEWID()
  • NEWSEQUENTIALID()

The following TSQL contains the DDL used to construct the tables for the 3 clustered keys:

-- IDENTITY
create table TestTable (
   id int identity(1,1) not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWID
create table TestTable (
   id uniqueidentifier default newid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWSEQUENTIALID
create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

For each test 50,000 rows were inserted into TestTable table using the following TSQL script:

-- Insert 50,000 rows.
declare @count int;
set @count = 0;
while @count < 50000 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

The following metrics were gathered after each run:

-- Get the number of read / writes for this session...
select reads, writes
from sys.dm_exec_sessions
where session_id = @@spid;

-- Get the page fragmentation and density at the leaf level.
select index_type_desc, index_depth, page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent, record_count
from sys.dm_db_index_physical_stats(db_id(), object_id('TestTable'), null, null, 'detailed')
where index_level = 0;
go

The results are quite compelling:

  Reads Writes Leaf Pages Avg Page Used Avg Fragmentation Record Count
IDENTITY(,) 0 1,683 1,667 98.9% 0.7% 50,000
NEWID() 0 5,386 2,486 69.3% 99.2% 50,000
NEWSEQUENTIALID() 0 1,746 1,725 99.9% 1.0% 50,000

Most striking is the number of writes required by the NEWID system function. This, coupled with the average page density of 69%, is evidence of the page splitting caused by the random distribution of inserts at the leaf level. As soon as a page fills up, it needs to be split into 2 pages of 50% each for the insert to complete. Not only has page splitting resulted in poor page density, it has fragmented the data pages quite badly (there is a 99% probability that the next data page is not next to the current one). In our tests the most likely place for a free page required for the page split is at the end of the table irrespective of where the row is being inserted. Therefore to read the rows in order the scan needs to keep jumping back and forth between widely distributed split pages, hence the appalling fragmentation.

The minor difference in page count between IDENTITY and NEWSEQUENTIALID metrics is due to the difference in size between IDENTITY's INT (4 bytes) and NEWSEQUENTIALID's UNQUEIDENTIFIER (16 bytes). However, note that any non-clustered index on the UNIQUEIDENTIFIER table will be 4 times larger than the same index on the IDENTITY table. Therefore IDENTITY is still preferred over NEWSEQUENTIALID when choosing between the two.

The NEWSEQUENTIALID system function has clearly lived up to its claim of providing GUID-like uniqueness coupled with identity-level insert performance. The number of writes, fragmentation and page density are all inline with identity-level metrics. These benefits make NEWSEQUENTIALID a compelling feature.

Labels: ,

Monday, 20 November 2006

The TABLESAMPLE Clause

posted by Stefan Delmarco

When I first same across TABLESAMPLE I had visions of some advanced index hints that allowed fine grained control over table statistics (distribution, ranges, density, etc.). Unfortunately not. As per Occam's Razor the simplest explanation is the correct one. It just... returns a sample from the specified table...?

The complete syntax for TABLESAMPLE is as follows (from BOL):

TABLESAMPLE

Given this syntax, the most obvious query to test drive TABLESAMPLE would be to return, say, 10 rows from the Sales.SalesOrderHeader table. After all, this syntax looks a lot like TOP:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (10 rows);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

How unexpected is that! Zero rows returned?! Huh? For the sake of adding to the confusion, change the query to return 1,000 rows as follows:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (1000 rows);

Then run the query a few times and observe how many rows get returned:

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44137 2001-09-01 00:00:00.000 3953.9884
...

(938 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44407 2001-10-16 00:00:00.000 3729.364
...

(1263 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44757 2001-12-01 00:00:00.000 49953.7086
...

(1085 rows(s) affected)

Each execution returns a different number of rows! Some of them just below 1,000 and others just above. Clearly there is a non-deterministic factor at play here. Without any data changing, re-running the identical query keeps giving different results. Digging into the BOL documentation shows that the TABLESAMPLE algorithm has a very strong statistical slant to it. It works as follows:

For the table affected by the TABLESAMPLE, SQL Server needs to determine a probability that a page of that table's data will be sampled. This probability is equal to the PERCENTAGE specified in the TABLESAMPLE. Alternatively, if a number of ROWS was specified, that number is divided by the total number of rows in the table to calculate the probability. In our case, the Sales.SalesOrderHeader table has 31,465 rows. We asked for 1,000 rows in the second query. Therefore the probability is calculated as 3.2%.

Once this probability has been calculated, SQL Server will evaluate it for every page of the table. The details of the algorithm aren't documented but it is along the lines of:

  • For each page, a random number between 0 and 1 is generated (e.g. RAND()).
  • If the random number is greater than the probability, SQL Server will pretend that the page for the table does not exist (i.e. it is not sampled / 'out-of-scope').
  • Otherwise, SQL Server considers all of the rows on the page sampled / 'in scope'.

This is very revealing as it explains the behaviour we saw with the first query. We asked for a sample of 10 rows. SQL Server therefore calculated the probability of a page being returned as:
10 / 31,465 = 0.032 %, a very small number. If we execute the query once, there is a chance SQL Server will not generate a random number that falls within this small probability. However, we can increase our chances of getting at least one page returned by executing the select statement a number of times:

declare @loopCount int;
set @loopCount = 0;

declare @rowcount int;
set @rowCount = 0;

-- Keep looping till we manage to sample at least one page.
while @rowCount = 0 begin
    select SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader tablesample system (10 rows);

    set @rowcount = @@rowcount;
    set @loopCount = @loopCount + 1;
end;

raiserror('Required %d loops to return %d rows', 0, 1, @loopCount, @rowCount);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
69998 2004-05-06 00:00:00.000 2649.8453
...

(46 rows(s) affected)

Required 5 loops to return 46 rows

Interestingly enough, when we get back some rows (i.e. we get a probability hit) we get 91 rows returned instead of 10. This reveals another facet of the TABLESAMPLE algorithm. It only goes as far as the probability calculation of a page for sampling data. Once it has decided that a page of the table is to be included in the sample, it will not limit the number of rows sampled from that page. Therefore you are always liable to get back the number of rows equivalent to the page size divided by the row width (assuming, of course, that the pages are densely packed). Running the above query repeatedly returns 40-45 rows, indicating that a page of the Sales.SalesOrderHeader table typically holds 40-45 rows per page.

A couple of additional points to note:

  • The TABLESAMPLE is applied before any WHERE clauses are evaluated.
  • This includes any JOIN conditions as well. Some misleading results can be obtained if two tables that are JOIN'd have a TABLESAMPLE applied to them. As the sampling of the tables are independent and uncorrelated, there is no guarantee that related rows from both tables will be included.
  • The TOP clause can be combined with TABLESAMPLE to cap the number of rows returned. We saw earlier how the TABLESAMPLE algorithm can return more rows than requested. However, there is still every chance that less rows will be returned than was specified in the TABLESAMPLE.

The REPEATABLE option of the TABLESAMPLE clause basically causes SQL Server to use the same random numbers when evaluating each of the table's pages again (SQL Server is probably using a linear congruential generator to obtain the random numbers with the same seed). The net effect is that the same pages will be sampled assuming no underlying data has been changed.

In terms of practical applications of the TABLESAMPLE algorithm, I really struggled to think of a situation where it would have been useful. At face value, all that TABLESAMPLE does is return a certain number or percentage of 'random-ish' rows from the specified table. Unless you have some highly-specialised requirement to get a 'flavour' for the data in the table, this new clause isn't that useful. It doesn't augment the TOP / ROWCOUNT clause at all and doesn't expose any customisable or extensible behaviour.

I strongly suspect that the TABLESAMPLE clause was added specifically for the optimiser's gathering of statistics. Anyone who has seen STATMAN scrolling past in a SQL Profiler trace will know that the update of statistics uses a number of undocumented TSQL operators and index hints. For example:

select SalesOrderId, OrderDate, TotalDue
from (
    select top 100 percent SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader with (readcommitted, sample 10e-2 percent)
    order by 1, 2, 3)
option (bypass optimizer_queue, maxdop 1);

I count at least 2 undocumented operators and 1 view ordering workaround being used in this query. TABLESAMPLE is definitely a much cleaner implementation for a query that appears to be performing the same function.

Who knows, maybe one day your boss will ask your team: "You know what, it would be really cool if we could return random pages from this table, according to a probability we specify, get back different results every time we execute it, and even return vastly different numbers of rows than what we asked for". You'll be the first one to raise your hand...

Labels: ,

Friday, 13 October 2006

The UNPIVOT Operator

posted by Stefan Delmarco

Completing the discussion on the PIVOT operator is its antonym, the UNPIVOT. However, I would argue that the UNPIVOT operator has been misnamed by Microsoft as it creates the mistaken impression that PIVOT -> UNPIVOT -> PIVOT gets you back to where you started. From BOL:

UNPIVOT performs the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

It fails to highlight that fact that PIVOT has aggregated the raw data in order to rotate the underlying rows into columns - you have to read to the very end of the Using PIVOT and UNPIVOT sample for the fine print in BOL. In contrast, the UNPIVOT operator does not perform any aggregation whatsoever. It just turns columns into rows. For each column in the table being UNPIVOT'd you get a row in the result set where the row values are ColumnName and ColumnValue. It therefore helps to think of transpose instead of UNPIVOT as this is more accurate description of the UNPIVOT operator's behaviour.

Nevertheless, let's have a look at the syntax to try and get a better feel for what UNPIVOT brings to the table. As a worked example, consider the following requirement: Write a query to produce a result set that contains a list of products (Production.Product) and their SellStartDate and SellEndDate. The SellStartDate and SellEndDate need to be listed in a single column with a second column specifying whether it is the SellStartDate or SellEndDate value being specified. Lastly, the result set needs to be sorted by this consolidated date column to produce a linear timeline of 'product events'.

In SQL Server 2000, you'd probably end up with a query that looks very much like the following:

select ProductId, [Name], 'SellStartDate' as DateType, SellStartDate as DateValue
from Production.Product
union all
select ProductId, [Name], 'SellEndDate' as DateType, SellEndDate as DateValue
from Production.Product
where SellEndDate is not null
order by DateValue;

ProductId Name DateType DateValue
----------- ----------- ----------- -----------
...
320 Chainring Bolts SellStartDate 1998-06-01 00:00:00.000
321 Chainring Nut SellStartDate 1998-06-01 00:00:00.000
322 Chainring SellStartDate 1998-06-01 00:00:00.000
323 Crown Race SellStartDate 1998-06-01 00:00:00.000
324 Chain Stays SellStartDate 1998-06-01 00:00:00.000
325 Decal 1 SellStartDate 1998-06-01 00:00:00.000
326 Decal 2 SellStartDate 1998-06-01 00:00:00.000
327 Down Tube SellStartDate 1998-06-01 00:00:00.000
328 Mountain End Caps SellStartDate 1998-06-01 00:00:00.000
329 Road End Caps SellStartDate 1998-06-01 00:00:00.000
330 Touring End Caps SellStartDate 1998-06-01 00:00:00.000
...

Using the UNPIVOT operator, in SQL Server 2005, the equivalent is as follows:

select ProductId, [Name], DateType, DateValue
from
    (select ProductID, [Name], SellStartDate, SellEndDate
    from Production.Product) as rawData
unpivot
    (DateValue for DateType in
       (SellStartDate, SellEndDate)) as transposed
order by DateValue;

The key to understanding the UNPIVOT syntax is to see how the DateValue and DateType columns are specified. When transposing the columns into rows, they effectively become name / value pairs. The name is the name of the column and value is the value of the column for the relative row. BOL calls it value_column and pivot_column but just think of them as name / value pairs.

The first portion of the query selects the raw data which we will be transposing. The UNPIVOT portion of the query requires the name of the columns to be used for the name and value pair columns (Value FOR Name) followed by the columns that will be transposed. Lastly, the outermost select allows you specify which transposed columns you want to return, so you can choose not to return the Name column if you don't want it.

As the SELECT...CASE statements were able to duplicate all the functionality of the PIVOT operator, SELECT...UNION ALL can perform all the same tricks as UNPIVOT. However, the advantage again is a more compact and expressive syntax.

One behaviour of UNPIVOT to watch out for is that it will not transpose column values that are NULL. NULL values will not appear as name / value pairs in the results. In the sample above I've compensated for this behaviour by including where SellEndDate is not null in the SQL Server 2000 equivalent query. Note that there is way to change this behaviour.

In terms of performance, the query optimiser highlights a significant difference between the SELECT...UNION ALL and UNPIVOT queries. The former requires a pass over the underlying table for each select (2 passes in our sample) whereas the UNPIVOT requires only a single pass. SQL Server is able to transpose the columns into rows for each column in a single pass over the underlying data. This makes the UNPIVOT query less expensive than the equivalent UNION ALL. The size of this penalty is proportional to the number of rows being processed. This fact along makes UNPIVOT a very useful tool to keep note of in your TSQL arsenal.

Labels: ,

Friday, 22 September 2006

The PIVOT Operator

posted by Stefan Delmarco

SQL Server 2005 introduces the PIVOT operator to the TSQL stable. The intention of this operator is to make it easier to transform unique values in rows into columns and aggregate the resulting data. Anyone familiar with Excel pivot tables or Access cross-tab queries will immediately be able to visualise what pivot tables look like. Prior to SQL Server 2005, the only way to transform data, in the manner provided by PIVOT, was to use a number of SELECT...CASE statements. As we'll see, the PIVOT operator has not given us functionality that was not possible before. However, it has made pivoting data much easier to perform and far more readable.

In order to understand how to form a PIVOT query, it is important to be able to describe, in plain English, what the query is trying to accomplish. Personally I have found that thinking of the PIVOT result set in terms of a graph (i.e. a X- and Y- axis) makes it easier to formulate into TSQL.

As a worked example, consider the following requirement. You need to generate a report that will correlate the number of addresses that were changed per city for each year. In other words, formulate a result set that will have each city listed as a column (X-axis). For each city, count the number of times an address in that city was modified in a given year (Y-axis). In SQL Server 2000, you'd probably create some TSQL similar to the following:

select datepart(yyyy, modifiedDate) as [WhenUpdates],
    sum(case city when 'Monroe' then 1 else 0 end) as [Monroe],
    sum(case city when 'Redmond' then 1 else 0 end) as [Redmond],
    sum(case city when 'Snohomish' then 1 else 0 end) as [Snohomish],
    sum(case city when 'Issaquah' then 1 else 0 end) as [Issaquah]
from Person.Address
group by datepart(yyyy, modifiedDate)
order by 1;

WhenUpdates Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

A couple of things to note:

  • Even though there are many more cities in the Person.Address table than the ones specified, we must 'hardcode' the city names into the query. There is no way to dynamically create additional columns if a new city is encountered. That fact alone makes this query difficult to maintain. If another city is added to Person.Address, this query would have to be modified.
  • The Y-axis (WhenUpdated) column is dynamic, we do not require prior knowledge of the modifiedDate's content when formulating the query. When an address is modified in a new year (e.g. 2005), another row will automatically be added to this result set without any changes.

Now, for the PIVOT operator. There is good news and bad news. The good news is that the syntax is far more compact and readable. The bad news is that it overcomes none of the limitations mentioned above! However, as we'll see, the good news is still worthwhile...

A PIVOT query is split into a number of sections. First, you need to write the SELECT statement that provides you with the raw, non-pivoted data. In our example, the raw data would be the following query:

select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
from Person.Address;

WhenUpdated AddressID City
----------- ----------- ------------------------------
...
2004 19 Bothell
1999 20 Bothell
1999 21 Bothell
2001 22 Portland
1999 23 Seattle
2001 24 Duluth
2002 25 Dallas
2001 26 San Francisco
...

WhenUpdated is the Y-axis, AddressID is a data point - we're going to count AddressIDs for a given (X,Y) / (City,WhenUpdated) combination - and City is the X-axis. Once we have the raw data, it is simple case of articulating which column is used for what function in the pivot:

select Pvt.*
from (
    -- The raw data as a derived table...
    select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
    from Person.Address) as RawData
pivot (
    count(AddressID)  -- How each X,Y point is calculated...
    for City          -- The X-axis, stated explicitly...
    in ([Monroe], [Redmond], [Snohomish], [Issaquah])) as Pvt
order by 1;

WhenUpdated Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

At first glance the PIVOT section seems a bit unintuitive. However, if you work through the syntax it will start to make sense. The PIVOT operator needs to know how to calculate each (X,Y) data point's value, specified as an aggregate function. The FOR column contains the X-axis values and the IN contains the hard-coded list of X-axis columns. As you can see, there is no getting around the SELECT...CASE limitation of having to know the transformed columns before-hand. In case you are wondering: no, IN cannot take a wildcard, e.g. IN(*). I believe that the limitation exists due to the optimizer's requirement to have fixed metadata when the execution plan is generated. This would not be possible if the column metadata is dynamically generated from the data itself.

The Y-axis is inferred as the distinct set of all columns that have not been used by the aggregate function or by the X-Axis. In our case, that would be the WhenUpdated computed column.

In terms of execution plans, the SELECT...CASE and PIVOT produce almost identical plans. However, as we're throwing away more of the rows in the Person.Address table (only 267 of the 19,614 rows are for the cities in our X-axis), both queries would be far more efficient if they included the set of X-axis column values (i.e. the names of the cities) in their WHERE clauses.

Instead of trying to dream up dynamic-SQL ways of working around this problem (non-dynamic columns and rows falling outside the predefined X-axis values) I would argue that you're better off aggregating the data in TSQL (using 'traditional' SELECT...GROUP BY statements) and transforming it in your middle tier / client application where there are better data structures that can be used to represent extremely wide or sparse result sets.

A better pattern to follow is to realise that the X-axis can be used to specify fixed ranges instead of raw values. Using this technique you can select ranges that will guarantee that you have covered off all possible X-axis values. For example, instead of allowing the raw unique values of a column to destabilise the PIVOT, the following query maps the values in the TotalDue column in a number of well-known buckets and then uses those values as the X-axis. Using this technique there is no possibility of having rows fall outside the X-axis range:

select Pvt.*
from (
    select datepart(yyyy, OrderDate) as [OrderDate],
        case
            when TotalDue < 100 then 'NoMargin'
            when TotalDue >= 100 and TotalDue > 1000 then 'OnTarget'
            else 'ExtraBonus' end as [Profit],
        PurchaseOrderId
    from Purchasing.PurchaseOrderHeader) as RawData
pivot (
    count(PurchaseOrderId)
    for Profit
    in([NoMargin], [OnTarget], [ExtraBonus])) as Pvt
order by 1;

OrderDate NoMargin OnTarget ExtraBonus
----------- ----------- ----------- -----------
2001 0 4 4
2002 7 105 160
2003 26 399 607
2004 68 1070 1550

As we've seen, the PIVOT operator does not provide any transformational capabilities that weren't possible in SQL Server 2000 (with liberal use of SELECT...CASE). However, as any experienced developer will know, the easier it is to express a reasonably complex SET-based problem, the better chance you have of getting it right, first time. Even more importantly, the poor developer required to maintain it (after you've moved on to greater conquests) will have a better chance of getting it right for the second time!

Labels: ,

Saturday, 26 August 2006

The APPLY Operator

posted by Stefan Delmarco

We've come across the APPLY operator before when we were exploring DMVs. The APPLY operator has been added to the TSQL repertoire in SQL Server 2005. It fills an important gap with User-Defined Functions (UDFs).

UDFs are less well-known than stored procedures (SPs) and are often overlooked by inexperienced SQL developers. BOL has a good write-up on how to choose between SPs and UDFs. For the context of this article, a brief overview of UDFs:

UDFs come in two flavours: those that returns scalars and those that return tables. Scalar-valued UDFs are very straightforward. They take in a bunch of parameters and return a single scalar value. These are functions in the traditional VB / C# / C++ sense. Table-valued functions are a little different. They also can take in a bunch of parameters. However, they are able to return a table instead of a scalar value. This returned table is referenced in TSQL in the same manner as a table or a view. This makes them look like a cross between functions and views. In fact, there are two types of table-valued UDFs. Those that contain a single SELECT statement (they look very much like a derived table) called 'Inline Table-Valued Functions' and those that build-up the return table using procedural logic (they look like stored procedures) called 'Multistatement Table-Values Functions'. It is the latter-type that we are interested in.

Multistatement table-valued functions are very powerful as they allow us to encapsulate complex procedural logic in a function when that logic cannot be implemented in a database view (i.e. you cannot express that desired output using a single set-based TSQL expression). You are free to construct the output rows in what ever manner you wish! The only restriction is that you need to declare the schema of the table that the UDF returns upfront as part of the UDF's declaration.

As an example, consider the following table-valued UDF:

create function dbo.udf_DecomposeDns (
    @webServiceUrl nvarchar(1024))
returns @dnsDomains table (
    domain nvarchar(1024))
as begin
    -- Return no rows if NULL...
    if @webServiceUrl is NULL
        return;

    declare @dns nvarchar(1024);

    -- Remove the contents after the DNS...
    declare @slashIndex int;
    set @slashIndex = charindex(N'/', @webServiceUrl);
    if @slashIndex > 0
        set @dns = left(@webServiceUrl, @slashIndex - 1);
    else
        set @dns = @webServiceUrl;

    -- Decompose the DNS into its hierarchical parts...
    declare @dotIndex int;
    set @dotIndex = charindex(N'.', @dns);
    insert @dnsDomains (domain) values (@dns);

    while @dotIndex > 0 begin
        set @dns = right(@dns, len(@dns) - @dotIndex);
        insert @dnsDomains (domain) values (@dns);
        set @dotIndex = charindex(N'.', @dns);
    end;
    return;
end;

This UDF will decompose a URL into its hierarchical DNS domain constituents. For the sake of brevity the URL is assumed to consist of a DNS domain name followed by a URL-path (as in AdventureWorks' Purchasing.Vendor PurchasingWebServiceURL column). For example:

select *
from dbo.udf_DecomposeDns(N'www.extranet.fotia.com/order.asmx')

domain
-------------------------
www.extranet.fotia.co.uk
extranet.fotia.co.uk
fotia.co.uk
co.uk
uk

So far, all this is well within the capabilities of SQL Server 2000. Now, consider the following requirement: Generate a report that will contain the DNS composition of each Vendor's PurchasingWebServiceURL in AdventureWorks' Purchasing.Vendor table.

To achieve this we want to decompose each URL into its domain hierarchy. For each vendor we would expect multiple rows to be returned, equivalent to the number of parts in that Vendor's URL. This would not be possible to do in a single SELECT statement in SQL Server 2000. The SELECT syntax did not have an operator / syntax that would allow a table-valued UDF to be re-evaluated repeatedly for every row it was being joined to. We could only execute a table-valued UDF once in a SELECT statement. We could pass a parameter to the UDF but the parameter value could not be from any column in the resulting set. It could only be a constant value or a declared variable.

This is the gap that the APPLY operator fills in SQL Server 2005. It allows us to get the table-valued UDF to be executed for each row it is being joined to. In this manner the parameters to a table-valued UDF should be seen as the join criteria.

The APPLY operator can be used in two forms: CROSS APPLY and OUTER APPLY. To understand the difference think of the table-valued UDF as a normal table you join to. You'd either use an INNER or an OUTER join. CROSS APPLY is an INNER JOIN, and OUTER APPLY is an OUTER JOIN. If the UDF returns an empty result set for a specific row, that row will be excluded if CROSS APPLY is used. However, if OUTER APPLY is used that row will still be included. NULL values will be substituted for the missing UDF row (identical to a traditional OUTER JOIN). It's that simple.

CROSS APPLY over the Purchasing.Vendor table produces the following result:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    cross apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
------------------- ------------------------------ --------------------------
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Wide World Importers www.wideworldimporters.com/ www.wideworldimporters.com
...

Compare the output with OUTER APPLY. Note the NULL rows substituted for the missing rows from the UDF:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
---------------------------------- ------------------------ --------------------
International NULL NULL
Electronic Bike Repair & Supplies NULL NULL
Premier Sport, Inc. NULL NULL
Comfort Road Bicycles NULL NULL
Metro Sport Equipment NULL NULL
Green Lake Bike Company NULL NULL
Mountain Works NULL NULL
Continental Pro Cycles NULL NULL
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Anderson's Custom Bikes NULL NULL
Compete, Inc. NULL NULL
...

We are able to bring the full might of TSQL to bear on table-valued UDFs. For example, aggregations can be run on the result set:

select domain, count(*) as [CountOf]
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL)
group by domain
order by 2 desc;

domain CountOf
----------------------------- -----------
NULL 98
com 5
litwareinc.com 1
net 1
northwindtraders.com 1
proseware.com 1
treyresearch.net 1
...

The APPLY operator is indeed a valuable addition to the TSQL. It allows a whole new modular and procedural approach to composing result sets.

Labels: ,

Monday, 7 August 2006

Introducing Common Table Expressions (CTEs)

posted by Stefan Delmarco

The concept of CTEs has been around in SQL Server for a long time. Anyone who's used derived tables before will immediately recognise and understand CTEs. Think of CTEs as formalised sub-queries or temporary views that can be defined and referenced in a single TSQL statement.

I've labelled this article as an introduction as CTEs bring a lot to the TSQL table, which we'll explore in much greater detail in future articles.

Consider the following requirement: you need a result-set that returns the sum of all products sold per product. For each product, you'll need to return a bunch of metadata about that product. In addition, your boss (for reasons only known to him) want these results only for products that are either red in colour or have more than 10 transactions! In SQL Server 2000 you might start writing something similar to the following:

select [Name], Color, sum(Quantity) as [SumQuantity]
from Production.Product P
    inner join Production.TransactionHistory T
        on P.ProductId = T.ProductId
where P.Color = 'red'
    or (    select count(*)
            from Production.TransactionHistory THinner
            where THinner.ProductId = P.ProductID) >= 10
group by [Name], Color;

Whilst this query gets the job done, it does look a bit messy as we're having to perform a correlated sub-query for each ProductID, evaluate it as a scalar and then determine whether or not we're going to calculate the sum of all quantities of all transactions for the associated ProductID. In addition, this TSQL is misrepresenting its purpose in the GROUP BY clause. The requirement stated that the summation needs to be performed per ProductID. We probably got lucky here as Name is unique per Product. If it wasn't we could get some unpredictable results (e.g. if there were two Pink FooBar's that differed only by some other metadata). Any column in the SELECT list that is not an aggregate must be included in the GROUP BY. It is the mix of aggregate and non-aggregate data that has lead to this mistake. A correct way to write this query would be to use a derived table:

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA
        on P.ProductId = PA.ProductID
where PA.[CountOfTransactions] >= 10
 or P.Color = 'red';

This is a much better as we have separated out the aggregate from the additional metadata portion of the query. The inner select statement is a derived table that is uncorrelated with the outer select so that it can execute independently. We have given this derived table the PA alias. Whilst this has given the TSQL much better structure, it does have some limitations. If we wanted to refer to the derived table again we would have to copy-paste it again into the select. For example, if we now wanted to include number of transactions the modulo 10 of the ProductID contained (simulating an arbitrary Product-to-Product relationship), we'd have to copy-and-paste the whole derived table again:

select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA1
        on P.ProductId = PA1.ProductID
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.[CountOfTransactions] >= 10
 or P.Color = 'red';

Now to introduce CTEs: Rewriting the above query to use a CTE will look as follows,

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)
select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join ProductAggregate PA1
        on P.ProductId = PA1.ProductID
    inner join ProductAggregate PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.CountOfTransactions >= 10
or P.Color = 'red';

As a developer, think of a CTE as a local variable declaration, having scope of a single TSQL statement. The WITH statement is the declaration that defines the columns (schema) of the CTE. The AS portion defines the implementation of the CTE. Once declared, the CTE exists as any normal table would in the TSQL statement. As demonstrated above, the CTE can be reused as many times are required in the TSQL statement. In the example above we're using it twice (with different aliases) yet we don't have to re-declare it (as we were required to with derived tables). CTE re-use extends to UNIONs as well. For example:

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where PA.CountOfTransactions >= 10

union all

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where P.Color = 'red';

A couple of other points to note about CTEs:

  • A CTE can only be used in the TSQL statement it is declared in. If you find yourself repeating a CTE a number of times then you should consider 'upgrading' it to a database view in its own right (assuming you're not using a CTE-only feature - mentioned below).
  • CTE's can be used in SELECT / UPDATE / DELETE / INSERT statements, anywhere a table can be used.
  • The previous statement of any TSQL statement that uses a CTE must be terminated with a semi-colon. This can be a bit infuriating if you're not used to using semi-colons in your TSQL but try to get into the habit of terminating all your TSQL statements with semi-colons (this works even in SQL Server 2000).

Anyone that's looked at CTEs before will have noted that I have not yet mentioned the one feature that everyone immediately highlights when discussing CTE's, i.e. CTE's can be used recursively to join on themselves. In my opinion, the recursive properties of CTEs will only be used in the very few queries that need to unroll / traverse hierarchies. The ability to write modular set-based queries is the primary reason we should all start using CTEs. They make queries easier to maintain and allow complex queries to be broken down into their logical components without sacrificing the power of set-based queries (as opposed to poorly performing but easier to read procedural TSQL).

Labels: ,

Thursday, 6 July 2006

The OUTPUT Clause

posted by Stefan Delmarco

If any new feature of TSQL will be a big hit it will be the OUTPUT clause. This is one new addition to TSQL that you'll use everywhere! How many times have you written this code?

-- A typical table...
create table someTable (
    Id int identity(1,1) primary key clustered,
    SomeData varchar(100) not null);
go

declare @id int;
-- Insert the data.
insert someTable (someData)
values ('DataOne');

-- Figure out what identity the row was given.
set @id = scope_identity();
-- Return it...
select @id as Id;
go

Isn't it tedious that you have to read the data that you just inserted as a second operation following the insert? The whole reason scope_identity exists is to protect you from any context that might be introduced between these two operations that you may not be aware of, like triggers (aarrgh!). After all, why can't you just write the following...

declare @id int;

-- Insert the data, better...
insert someTable (someData)
output inserted.Id
values ('DataTwo');
go

This is exactly how the OUTPUT clause works. It allows data that is inserted / updated / deleted to be returned. Anyone that's done any work with triggers would have immediately recognised the 'virtual' inserted table. In fact, drawing parallels between triggers and the OUTPUT clause is the best way to understand what's going on here. The OUTPUT clause allows you to access the virtual trigger tables inline with your INSERT / UPDATE / DELETE statements. You essentially get 'on-the-fly' trigger access. For INSERT you have the inserted table, for update you have both the inserted and deleted tables, and for DELETE just the deleted table. Simple, isn't it? In the above example we use the OUTPUT clause to get to the identity column we inserted. It could have been any column in the table. Think how useful this would be for retrieving timestamp values or computed columns.

Have a look at these examples to see what some of the practical applications are:

(1) What did I just delete?

delete someTable
output deleted.*
where id = 1;
go

This query returns all of the data that was just deleted. Great for audit records! It also prevents having to take out locks on the rows you're going to delete so that the values read are the ones deleted. No encompassing transaction required here.

(2) Before and after values

update someTable
set someData = 'DataThree'
output deleted.someData as Before,
    inserted.someData as After,
    difference(deleted.someData, inserted.someData) as Delta
where id = 2;
go
Before After Delta
--------- ---------- ------
DataTwo DataThree 3

This query will return the values for someData as it was before the update and it is after the update in a single row. In addition, note that the OUTPUT clause is defined as a meaning that any expressions (i.e. functions / calculations, etc.) you could do in a SELECT you can do in an OUTPUT.

(3) Deleting multiple rows

-- Declare a table variable.
declare @someTableTemp table (
    Id int not null,
    someData varchar(100) not null);

-- Capture the output into a table variable.
delete someTable
output deleted.id, deleted.someData
into @someTableTemp
    (Id, someData);

-- What was deleted...
select *
from @someTableTemp;
go

The output can also be captured into a table variable. It doesn't have to be a table variable. It could be any permanent table. In this case think of the OUTPUT ... INTO as an INSERT ... SELECT. Just for the hell of it you can write crazy TSQL like this:

-- The delete that never was!
delete someTable
output deleted.someData
into someTable (someData);
go

Lastly, a couple of points to remember:

  • The values in the output virtual tables will be populated after any triggers for the table have fired, so remember that you are at the back of the queue.
  • Inserting / updating / deleting data and then re-reading the rows affected is always dangerous because it makes you susceptible to deadlocks. Using the OUTPUT clause will avoid this as the action and read occur as a single unit. In SQL Server 2000 there was a trick with the UPDATE statement that allowed you to avoid re-reading updates if you're interested in the value of a specific column that was updated (this only works if the UPDATE affects a single row). Consider the following table:

create table ReserveRange (
    IdName varchar(100) not null primary key clustered,
    Id int not null);
go

insert ReserveRange (IdName, Id)
values ('Id#1', 0);
go

You want to use this table to reserve ranges of Id's. You need to:

  • read the current Id,
  • increment it by a certain amount,
  • return the range that has been reserved in a thread-safe manner.

Normally developers wrap the TSQL in a transaction and use a UPDLOCK locking hint to serialize access to the table like this (a couple of variations of this are possible but all require the transaction and a long-running update lock):

begin tran;

declare @range int;
set @range = 100;
declare @id int;

-- Get the current value.
select @id = Id;
from ReserveRange with(updlock)
where IdName = 'Id#1';

-- Reserve a range.
update ReserveRange
set id = @id + @range
where IdName = 'Id#1';

-- Return the range
select @id + 1 as LowRange,
    @id + @range as HighRange;

commit tran;
go

The trick is to notice that the UPDATE statement supported triple assignments. Have a look at the following:

declare @range int;
set @range = 100;

declare @id int;

-- Update the current range and assign it to
-- a local variable
update ReserveRange
set @id = id = id + @range
where IdName = 'Id#1';

-- Return the range
select @id - @range + 1 as LowRange,
    @id as HighRange;

Note, no encompassing transaction and no long-running locks. Next we'll look at VARCHAR(MAX).

Labels: ,

Thursday, 6 April 2006

The OVER Clause Part II

posted by Stefan Delmarco

The only mention that the OVER clause can be used with aggregate functions (SUM, COUNT, etc.) is in the following statement in BOL (Books Online):

The OVER Clause

It gives no hint or inkling of the wealth of possibilities this opens up for queries that need to contain a mix aggregate and non-aggregate results. Suppose you were asked to write a query that will return a result set of all customers' FirstName, LastName and EmailAddress. For each customer, return a count of how many other customers have the same FirstName and a count of how many have the same LastName. Restrict this to customers whose LastName begins with a 'D' to keep it short. In SQL Server 2000 you'd have to start thinking of sub-selects and correlated sub-queries to mix these aggregate (one grouping on FirstName the other on LastName) and non-aggregate (each customer row needs to be returned independent of any aggregate grouping) requirements. If the requirement was for a pure aggregation (e.g. count of customers grouping by LastName) it would be very straight forward. However, with the mix you'd probably come up with something that looks like one of these:

-- #1, Expensive...
select FirstName, LastName, EmailAddress,
    (select count(*)
    from Person.Contact Cinner
    where Cinner.FirstName = Couter.FirstName
     and Cinner.LastName like 'D%') as [CountOfFirstName],
    (select count(*)
    from Person.Contact Cinner
    where Cinner.LastName = Couter.LastName
     and Cinner.LastName like 'D%') as [CountOfLastName]
from Person.Contact Couter
where LastName like 'D%'
order by LastName, FirstName

-- #2 Better...
select FirstName, LastName, EmailAddress, CountOfFirstName, CountOfLastName
from Person.Contact C
    inner join
        (select FirstName as [FirstNameCounted],
            count(*) as [CountOfFirstName]
        from Person.Contact
        where LastName like 'D%'
        group by FirstName) as [CountFirstNames]
    on C.FirstName = CountFirstNames.FirstNameCounted
    inner join
        (select LastName as [LastNameCounted],
            count(*) as [CountOfLastName]
        from Person.Contact
        where LastName like 'D%'
        group by LastName) as [CountLastNames]
    on C.LastName = CountLastNames.LastNameCounted
where LastName like 'D%'
order by LastName, FirstName

#1 is twice as expensive as #2. Execute these queries and have a look at the execution plans to understand why (we'll covering analysis of execution plans in the future). Both these queries produce the following result set (abbreviated):

FirstName LastName EmailAddress CountOfFirstName CountOfLastName
--------- --------- ------------------------------ ----------------- ---------------
...
David Daniels david34@adventure-works.com 87 2
David Daniels david31@adventure-works.com 87 2
Ryan Danner ryan3@adventure-works.com 60 2
Ryan Danner ryan5@adventure-works.com 60 2
Mike Danseglio mike2@adventure-works.com 9 1
Alexander Davis alexander7@adventure-works.com 23 77
Alexandra Davis alexandra68@adventure-works.com 93 77
Alyssa Davis alyssa4@adventure-works.com 67 77
...

As a quick sanity check we can see that anyone with the surname Davis has CountOfLastName as 77 and anyone with the FirstName David has CountOfFirstName as 87.

In the same manner the OVER clause applies to the window ranking function that immediately precedes it, the OVER clause affects the aggregate function that precedes it. In addition, if there are multiple aggregate functions, each one can have its own OVER clause that can be different from the others. However, as per the documentation snippet from above, aggregate functions only include the PARTITION BY clause in their OVER clause. Unlike ranking functions, there is no concept of an ORDER BY for an aggregate function (this wouldn't have any meaning to an aggregation anyway). The same query can now be rewritten in SQL Server 2005 as follows:

select FirstName, LastName, EmailAddress,
    count(*) over(partition by FirstName) as CountOfFirstName,
    count(*) over(partition by LastName) as CountOfLastName
from Person.Contact
where LastName like 'D%'
order by LastName, FirstName

How cool is that?! This rocks! Not only does it read much better than the queries above, it is the cheapest to execute by a long long shot (65% cheaper than #1 and 45% cheaper than #2 - have a look at the execution plans). This query produces the identical results to the queries above. Note that there is not a GROUP BY to be seen. SQL Server has essentially substituted the GROUP BY that all aggregate functions need with the PARTITION BY clause. As SQL Server is producing the result set, it is streaming the rows through the aggregate functions. The PARTITION BY tells SQL Server how to group the rows when applying the aggregation function in the same manner the PARTITION BY was telling SQL Server when to restart the ranking function.

Note that if you need to write a query that produces a pure aggregation then the normal GROUP BY form needs to be used as you'll want the source result set collapsed to their distinct values that the aggregate functions iterated over. If you need to mix aggregate and non-aggregate functions then OVER with PARTITION BY is the best option. Next up is the new TRY...CATCH syntax.

Labels: ,

Monday, 20 March 2006

The OVER Clause Part I

posted by Stefan Delmarco

The OVER clause is new to SQL Server 2005. It is used for aggregate (SUM, AVG, COUNT, etc.) and ranking (NTILE, ROW_NUMBER, RANK, etc.) functions. Previously we've seen it used to specify the sorting SQL Server needs to apply to a result set before evaluating its associated window function for each row. For example, the following query results a list of all vendors, their post codes and cities. Each row is assigned a ROW_NUMBER according to its absolute position when the rows are ordered according to their PostalCodes.

select row_number() over(order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by PostalCode;

RowNumber Name PostalCode City
--------- ---------------------------------- ---------- -------------
1 Premier Sport, Inc. 02113 Boston
2 A. Datum Corporation 10007 New York
3 Morgan Bike Accessories 12210 Albany
4 Wood Fitness 19107 Philadelphia
5 Competition Bike Training Systems 30308 Atlanta
6 Bike Satellite Inc. 37501 Downey
7 Midwest Sport, Inc. 48226 Detroit
...

Now, suppose I didn't want the row number to be consecutive over the entire result set. For example, I may want to have the rows numbered from 1 to n for each city. That is, I want the row numbers to be consecutive (1,2,3...n) per city. This is what the PARTITION BY clause allows us to do. By including a PARTITION BY clause in the window function's OVER clause we can essentially restart the function when the value of the column specified in the PARTITION BY changes. For example, to number the rows consecutively per city the query changes to:

select row_number() over(partition by City order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by City;
RowNumber Name PostalCode City
--------- --------------------------------- ----------- -----------
1 Morgan Bike Accessories 12210 Albany
2 Leaf River Terrain 97321 Albany
1 Allenson Cycles 91001 Altadena
2 Gardner Touring Cycles 91001 Altadena
1 Speed Corporation 98221 Anacortes
2 Electronic Bike Co. 98221 Anacortes
3 Northern Bike Travel 98221 Anacortes
1 Competition Bike Training Systems 30308 Atlanta
...

As you can see, the ROW_NUMBER function has restarted its numbering as soon as the City changed. The PARTITION BY clause can contain any number of columns available in the tables in the FROM clause. As PARTITION BY defines its arguments as an expression, we can start getting creative in how we perform the partitioning. For example, the following query partitions the row number according to the second letter of the address:

select row_number() over(partition by substring(City, 2, 1) order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by substring(City, 2, 1);

RowNumber Name PostalCode City
--------- --------------------------------- ----------- -----------
...
15 Bicycle Specialists 97034 Lake Oswego
16 Vista Road Bikes 97301 Salem
17 Knopfler Cycles 97301 Salem
18 Recreation Place 97301 Salem
19 Reliance Fitness, Inc. 98107 Ballard
20 Electronic Bike Repair & Supplies 98403 Tacoma
21 Cycling Master 99362 Walla Walla
1 A. Datum Corporation 10007 New York
2 Midwest Sport, Inc. 48226 Detroit
3 Green Lake Bike Company 80203 Denver
4 Beaumont Bikes 83301 West Covina
5 Greenwood Athletic Company 85252 Lemon Grove
...

ROW_NUMBER restarted its numbering as soon as the second letter of the City name changed from 'a' to 'e'. All of the above applies equally to RANK / DENSE_RANK and NTILE. The OVER clause can also be used with aggregate functions (like MIN, MAX, AVG, SUM, etc) which makes mixing aggregations and non-aggregations in a single query a lot easier than before. That's what we'll be looking at next.

Labels: ,

Thursday, 9 March 2006

The RANK and DENSE_RANK Ranking Functions

posted by Stefan Delmarco

RANK and DENSE_RANK are the last of the new ranking (window) functions. The best analogy of how these functions work is to think of how you'd solve the following problem: You've been asked to write a report that will generate a list of the best selling products. You're required to produce a result set that contains the product name, total quantity sold and assign each product a ranking (i.e. 1st, 2nd, 3rd, etc.). So you might write something like this:

select P.[Name],
    sum(TH.Quantity) as [TotalQuantity],
    row_number() over(order by sum(quantity) desc) as [Ranking]
from Production.TransactionHistory TH
    inner join Production.Product P
        on TH.ProductID = P.ProductID
group by P.[Name]
order by [TotalQuantity] desc;

Name TotalQuantity Ranking
----------------- ------------- --------------------
BB Ball Bearing 374090 1
Seat Stays 187376 2
Blade 93688 3
Chain Stays 93688 4
Fork End 93688 5
HL Crankarm 64900 6
HL Hub 56264 7
Decal 1