Fotia Ltd

Fotia Ltd

Friday, 22 September 2006

The PIVOT Operator

posted by Stefan Delmarco

SQL Server 2005 introduces the PIVOT operator to the TSQL stable. The intention of this operator is to make it easier to transform unique values in rows into columns and aggregate the resulting data. Anyone familiar with Excel pivot tables or Access cross-tab queries will immediately be able to visualise what pivot tables look like. Prior to SQL Server 2005, the only way to transform data, in the manner provided by PIVOT, was to use a number of SELECT...CASE statements. As we'll see, the PIVOT operator has not given us functionality that was not possible before. However, it has made pivoting data much easier to perform and far more readable.

In order to understand how to form a PIVOT query, it is important to be able to describe, in plain English, what the query is trying to accomplish. Personally I have found that thinking of the PIVOT result set in terms of a graph (i.e. a X- and Y- axis) makes it easier to formulate into TSQL.

As a worked example, consider the following requirement. You need to generate a report that will correlate the number of addresses that were changed per city for each year. In other words, formulate a result set that will have each city listed as a column (X-axis). For each city, count the number of times an address in that city was modified in a given year (Y-axis). In SQL Server 2000, you'd probably create some TSQL similar to the following:

select datepart(yyyy, modifiedDate) as [WhenUpdates],
    sum(case city when 'Monroe' then 1 else 0 end) as [Monroe],
    sum(case city when 'Redmond' then 1 else 0 end) as [Redmond],
    sum(case city when 'Snohomish' then 1 else 0 end) as [Snohomish],
    sum(case city when 'Issaquah' then 1 else 0 end) as [Issaquah]
from Person.Address
group by datepart(yyyy, modifiedDate)
order by 1;

WhenUpdates Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

A couple of things to note:

  • Even though there are many more cities in the Person.Address table than the ones specified, we must 'hardcode' the city names into the query. There is no way to dynamically create additional columns if a new city is encountered. That fact alone makes this query difficult to maintain. If another city is added to Person.Address, this query would have to be modified.
  • The Y-axis (WhenUpdated) column is dynamic, we do not require prior knowledge of the modifiedDate's content when formulating the query. When an address is modified in a new year (e.g. 2005), another row will automatically be added to this result set without any changes.

Now, for the PIVOT operator. There is good news and bad news. The good news is that the syntax is far more compact and readable. The bad news is that it overcomes none of the limitations mentioned above! However, as we'll see, the good news is still worthwhile...

A PIVOT query is split into a number of sections. First, you need to write the SELECT statement that provides you with the raw, non-pivoted data. In our example, the raw data would be the following query:

select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
from Person.Address;

WhenUpdated AddressID City
----------- ----------- ------------------------------
...
2004 19 Bothell
1999 20 Bothell
1999 21 Bothell
2001 22 Portland
1999 23 Seattle
2001 24 Duluth
2002 25 Dallas
2001 26 San Francisco
...

WhenUpdated is the Y-axis, AddressID is a data point - we're going to count AddressIDs for a given (X,Y) / (City,WhenUpdated) combination - and City is the X-axis. Once we have the raw data, it is simple case of articulating which column is used for what function in the pivot:

select Pvt.*
from (
    -- The raw data as a derived table...
    select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
    from Person.Address) as RawData
pivot (
    count(AddressID)  -- How each X,Y point is calculated...
    for City          -- The X-axis, stated explicitly...
    in ([Monroe], [Redmond], [Snohomish], [Issaquah])) as Pvt
order by 1;

WhenUpdated Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

At first glance the PIVOT section seems a bit unintuitive. However, if you work through the syntax it will start to make sense. The PIVOT operator needs to know how to calculate each (X,Y) data point's value, specified as an aggregate function. The FOR column contains the X-axis values and the IN contains the hard-coded list of X-axis columns. As you can see, there is no getting around the SELECT...CASE limitation of having to know the transformed columns before-hand. In case you are wondering: no, IN cannot take a wildcard, e.g. IN(*). I believe that the limitation exists due to the optimizer's requirement to have fixed metadata when the execution plan is generated. This would not be possible if the column metadata is dynamically generated from the data itself.

The Y-axis is inferred as the distinct set of all columns that have not been used by the aggregate function or by the X-Axis. In our case, that would be the WhenUpdated computed column.

In terms of execution plans, the SELECT...CASE and PIVOT produce almost identical plans. However, as we're throwing away more of the rows in the Person.Address table (only 267 of the 19,614 rows are for the cities in our X-axis), both queries would be far more efficient if they included the set of X-axis column values (i.e. the names of the cities) in their WHERE clauses.

Instead of trying to dream up dynamic-SQL ways of working around this problem (non-dynamic columns and rows falling outside the predefined X-axis values) I would argue that you're better off aggregating the data in TSQL (using 'traditional' SELECT...GROUP BY statements) and transforming it in your middle tier / client application where there are better data structures that can be used to represent extremely wide or sparse result sets.

A better pattern to follow is to realise that the X-axis can be used to specify fixed ranges instead of raw values. Using this technique you can select ranges that will guarantee that you have covered off all possible X-axis values. For example, instead of allowing the raw unique values of a column to destabilise the PIVOT, the following query maps the values in the TotalDue column in a number of well-known buckets and then uses those values as the X-axis. Using this technique there is no possibility of having rows fall outside the X-axis range:

select Pvt.*
from (
    select datepart(yyyy, OrderDate) as [OrderDate],
        case
            when TotalDue < 100 then 'NoMargin'
            when TotalDue >= 100 and TotalDue > 1000 then 'OnTarget'
            else 'ExtraBonus' end as [Profit],
        PurchaseOrderId
    from Purchasing.PurchaseOrderHeader) as RawData
pivot (
    count(PurchaseOrderId)
    for Profit
    in([NoMargin], [OnTarget], [ExtraBonus])) as Pvt
order by 1;

OrderDate NoMargin OnTarget ExtraBonus
----------- ----------- ----------- -----------
2001 0 4 4
2002 7 105 160
2003 26 399 607
2004 68 1070 1550

As we've seen, the PIVOT operator does not provide any transformational capabilities that weren't possible in SQL Server 2000 (with liberal use of SELECT...CASE). However, as any experienced developer will know, the easier it is to express a reasonably complex SET-based problem, the better chance you have of getting it right, first time. Even more importantly, the poor developer required to maintain it (after you've moved on to greater conquests) will have a better chance of getting it right for the second time!

Labels: ,

Previous Posts