Fotia Ltd

Fotia Ltd

Thursday, 6 April 2006

The OVER Clause Part II

posted by Stefan Delmarco

The only mention that the OVER clause can be used with aggregate functions (SUM, COUNT, etc.) is in the following statement in BOL (Books Online):

The OVER Clause

It gives no hint or inkling of the wealth of possibilities this opens up for queries that need to contain a mix aggregate and non-aggregate results. Suppose you were asked to write a query that will return a result set of all customers' FirstName, LastName and EmailAddress. For each customer, return a count of how many other customers have the same FirstName and a count of how many have the same LastName. Restrict this to customers whose LastName begins with a 'D' to keep it short. In SQL Server 2000 you'd have to start thinking of sub-selects and correlated sub-queries to mix these aggregate (one grouping on FirstName the other on LastName) and non-aggregate (each customer row needs to be returned independent of any aggregate grouping) requirements. If the requirement was for a pure aggregation (e.g. count of customers grouping by LastName) it would be very straight forward. However, with the mix you'd probably come up with something that looks like one of these:

-- #1, Expensive...
select FirstName, LastName, EmailAddress,
    (select count(*)
    from Person.Contact Cinner
    where Cinner.FirstName = Couter.FirstName
     and Cinner.LastName like 'D%') as [CountOfFirstName],
    (select count(*)
    from Person.Contact Cinner
    where Cinner.LastName = Couter.LastName
     and Cinner.LastName like 'D%') as [CountOfLastName]
from Person.Contact Couter
where LastName like 'D%'
order by LastName, FirstName

-- #2 Better...
select FirstName, LastName, EmailAddress, CountOfFirstName, CountOfLastName
from Person.Contact C
    inner join
        (select FirstName as [FirstNameCounted],
            count(*) as [CountOfFirstName]
        from Person.Contact
        where LastName like 'D%'
        group by FirstName) as [CountFirstNames]
    on C.FirstName = CountFirstNames.FirstNameCounted
    inner join
        (select LastName as [LastNameCounted],
            count(*) as [CountOfLastName]
        from Person.Contact
        where LastName like 'D%'
        group by LastName) as [CountLastNames]
    on C.LastName = CountLastNames.LastNameCounted
where LastName like 'D%'
order by LastName, FirstName

#1 is twice as expensive as #2. Execute these queries and have a look at the execution plans to understand why (we'll covering analysis of execution plans in the future). Both these queries produce the following result set (abbreviated):

FirstName LastName EmailAddress CountOfFirstName CountOfLastName
--------- --------- ------------------------------ ----------------- ---------------
...
David Daniels david34@adventure-works.com 87 2
David Daniels david31@adventure-works.com 87 2
Ryan Danner ryan3@adventure-works.com 60 2
Ryan Danner ryan5@adventure-works.com 60 2
Mike Danseglio mike2@adventure-works.com 9 1
Alexander Davis alexander7@adventure-works.com 23 77
Alexandra Davis alexandra68@adventure-works.com 93 77
Alyssa Davis alyssa4@adventure-works.com 67 77
...

As a quick sanity check we can see that anyone with the surname Davis has CountOfLastName as 77 and anyone with the FirstName David has CountOfFirstName as 87.

In the same manner the OVER clause applies to the window ranking function that immediately precedes it, the OVER clause affects the aggregate function that precedes it. In addition, if there are multiple aggregate functions, each one can have its own OVER clause that can be different from the others. However, as per the documentation snippet from above, aggregate functions only include the PARTITION BY clause in their OVER clause. Unlike ranking functions, there is no concept of an ORDER BY for an aggregate function (this wouldn't have any meaning to an aggregation anyway). The same query can now be rewritten in SQL Server 2005 as follows:

select FirstName, LastName, EmailAddress,
    count(*) over(partition by FirstName) as CountOfFirstName,
    count(*) over(partition by LastName) as CountOfLastName
from Person.Contact
where LastName like 'D%'
order by LastName, FirstName

How cool is that?! This rocks! Not only does it read much better than the queries above, it is the cheapest to execute by a long long shot (65% cheaper than #1 and 45% cheaper than #2 - have a look at the execution plans). This query produces the identical results to the queries above. Note that there is not a GROUP BY to be seen. SQL Server has essentially substituted the GROUP BY that all aggregate functions need with the PARTITION BY clause. As SQL Server is producing the result set, it is streaming the rows through the aggregate functions. The PARTITION BY tells SQL Server how to group the rows when applying the aggregation function in the same manner the PARTITION BY was telling SQL Server when to restart the ranking function.

Note that if you need to write a query that produces a pure aggregation then the normal GROUP BY form needs to be used as you'll want the source result set collapsed to their distinct values that the aggregate functions iterated over. If you need to mix aggregate and non-aggregate functions then OVER with PARTITION BY is the best option. Next up is the new TRY...CATCH syntax.

Labels: ,

Previous Posts