Fotia Ltd

Fotia Ltd

Monday, 7 August 2006

Introducing Common Table Expressions (CTEs)

posted by Stefan Delmarco

The concept of CTEs has been around in SQL Server for a long time. Anyone who's used derived tables before will immediately recognise and understand CTEs. Think of CTEs as formalised sub-queries or temporary views that can be defined and referenced in a single TSQL statement.

I've labelled this article as an introduction as CTEs bring a lot to the TSQL table, which we'll explore in much greater detail in future articles.

Consider the following requirement: you need a result-set that returns the sum of all products sold per product. For each product, you'll need to return a bunch of metadata about that product. In addition, your boss (for reasons only known to him) want these results only for products that are either red in colour or have more than 10 transactions! In SQL Server 2000 you might start writing something similar to the following:

select [Name], Color, sum(Quantity) as [SumQuantity]
from Production.Product P
    inner join Production.TransactionHistory T
        on P.ProductId = T.ProductId
where P.Color = 'red'
    or (    select count(*)
            from Production.TransactionHistory THinner
            where THinner.ProductId = P.ProductID) >= 10
group by [Name], Color;

Whilst this query gets the job done, it does look a bit messy as we're having to perform a correlated sub-query for each ProductID, evaluate it as a scalar and then determine whether or not we're going to calculate the sum of all quantities of all transactions for the associated ProductID. In addition, this TSQL is misrepresenting its purpose in the GROUP BY clause. The requirement stated that the summation needs to be performed per ProductID. We probably got lucky here as Name is unique per Product. If it wasn't we could get some unpredictable results (e.g. if there were two Pink FooBar's that differed only by some other metadata). Any column in the SELECT list that is not an aggregate must be included in the GROUP BY. It is the mix of aggregate and non-aggregate data that has lead to this mistake. A correct way to write this query would be to use a derived table:

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA
        on P.ProductId = PA.ProductID
where PA.[CountOfTransactions] >= 10
 or P.Color = 'red';

This is a much better as we have separated out the aggregate from the additional metadata portion of the query. The inner select statement is a derived table that is uncorrelated with the outer select so that it can execute independently. We have given this derived table the PA alias. Whilst this has given the TSQL much better structure, it does have some limitations. If we wanted to refer to the derived table again we would have to copy-paste it again into the select. For example, if we now wanted to include number of transactions the modulo 10 of the ProductID contained (simulating an arbitrary Product-to-Product relationship), we'd have to copy-and-paste the whole derived table again:

select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA1
        on P.ProductId = PA1.ProductID
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.[CountOfTransactions] >= 10
 or P.Color = 'red';

Now to introduce CTEs: Rewriting the above query to use a CTE will look as follows,

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)
select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join ProductAggregate PA1
        on P.ProductId = PA1.ProductID
    inner join ProductAggregate PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.CountOfTransactions >= 10
or P.Color = 'red';

As a developer, think of a CTE as a local variable declaration, having scope of a single TSQL statement. The WITH statement is the declaration that defines the columns (schema) of the CTE. The AS portion defines the implementation of the CTE. Once declared, the CTE exists as any normal table would in the TSQL statement. As demonstrated above, the CTE can be reused as many times are required in the TSQL statement. In the example above we're using it twice (with different aliases) yet we don't have to re-declare it (as we were required to with derived tables). CTE re-use extends to UNIONs as well. For example:

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where PA.CountOfTransactions >= 10

union all

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where P.Color = 'red';

A couple of other points to note about CTEs:

  • A CTE can only be used in the TSQL statement it is declared in. If you find yourself repeating a CTE a number of times then you should consider 'upgrading' it to a database view in its own right (assuming you're not using a CTE-only feature - mentioned below).
  • CTE's can be used in SELECT / UPDATE / DELETE / INSERT statements, anywhere a table can be used.
  • The previous statement of any TSQL statement that uses a CTE must be terminated with a semi-colon. This can be a bit infuriating if you're not used to using semi-colons in your TSQL but try to get into the habit of terminating all your TSQL statements with semi-colons (this works even in SQL Server 2000).

Anyone that's looked at CTEs before will have noted that I have not yet mentioned the one feature that everyone immediately highlights when discussing CTE's, i.e. CTE's can be used recursively to join on themselves. In my opinion, the recursive properties of CTEs will only be used in the very few queries that need to unroll / traverse hierarchies. The ability to write modular set-based queries is the primary reason we should all start using CTEs. They make queries easier to maintain and allow complex queries to be broken down into their logical components without sacrificing the power of set-based queries (as opposed to poorly performing but easier to read procedural TSQL).

Labels: ,

Previous Posts