Fotia Ltd

Fotia Ltd

Friday, 1 February 2008

Partitioned Views on a Budget

posted by Stefan Delmarco

Partitioned views are a great feature of SQL Server. By injecting an additional layer of abstraction between your logical view and physical implementation they give you fine grained control over data partitioning. However, the cover charge required for entry to this exclusive club is pretty high. The checklist required to implement a partitioned view can often interfere with your design to such a degree that they become more of a hindrance than a benefit. However, by decomposing the features SQL Server uses to implement the complete partitioned views specification we can get most of the benefits without meeting all of the criteria.

One of the benefits of partitioned views is the query optimizer's ability to determine which subset of the member tables need to be queried for each access to the partitioned view. The SQL Server query optimizer has clever features built into it that are able to understand variables and constants in TSQL and relate those constants to search predicates. The optimizer is able to use that insight to short-circuit query plan execution and not bother with branches that will never execute. For example:

declare @constant int;
set @constant = 0;

select *
from Person.Contact
where 1 = @constant;

The query plan for this bit of TSQL looks like this:

The SQL Server query optimizer has introduced a filter into the query plan that has a special property called a Startup Expression Predicate. This property is the short-circuit logic that determines whether or not the clustered index scan on the Person.Contact table executes or not. If the startup expression evaluates to FALSE then the optimizer doesn't bother executing any portion of that query. In this example, SQL Server will do nothing when @constant = 1 evaluates to FALSE. It is the exact same technology that partitioned views use to determine which underlying member tables to access based on the member tables' check constraints / partitioning columns.

For example, consider a situation where I would like to query a table that exists in a number of databases. I would like to have a partitioned view that aggregates these member tables in a single view. However, I don't want to pay the penalty of having to query every member table for every query. If I specify the database name in the SELECT query, I want the optimizer to use its short-circuit feature to go directly to the one table that will satisfy the search arguments. The solution is easier than you think!

create database Database01;
create database Database02;
create database Database10;
create database Database11;
go

use Database01
create table MemberTable (
   Col1 varchar(10));
go

use Database02
create table MemberTable (
   Col1 varchar(10));
go

use Database10
create table MemberTable (
   Col1 varchar(10));
go


use Database11
create table MemberTable (
   Col1 varchar(10));
go

insert Database01.dbo.MemberTable values ('Data01');
insert Database02.dbo.MemberTable values ('Data02');
insert Database10.dbo.MemberTable values ('Data10');
insert Database11.dbo.MemberTable values ('Data11');
go

use tempdb;
go
create view BudgetView as
select 'Database01' as SourceDatabase, Col1
from Database01.dbo.MemberTable
union all
select 'Database02', Col1
from Database02.dbo.MemberTable
union all
select 'Database10', Col1
from Database10.dbo.MemberTable
union all
select
'Database11', Col1
from Database11.dbo.MemberTable
go

select *
from BudgetView;

SourceDatabase Col1
-------------- ----------
Database01 Data01
Database02 Data02
Database10 Data10
Database11 Data11

(4 rows(s) affected)

The query plan for a SELECT of all of the rows shows SQL accessing each member table in turn:

This is the clever bit. We've put the partitioning criteria that separates each member table into the view's definition. We've hardcoded the database name of each member table into the first column of the view. The really clever bit is that SQL Server understands this! Try the following:

select *
from BudgetView
where SourceDatabase = 'Database01';

SourceDatabase Col1
-------------- ----------
Database01 Data01

(1 row(s) affected)

SQL Server has looked at the search argument we specified for SourceDatabase and short-circuited the query plan. It only queried Database01's MemberTable and didn't bother accessing any of the others!

It gets even better, SQL Server understands all manner of comparisons, including LIKEs, in this example it only access databases where the name starts with Database0.

select *
from BudgetView
where SourceDatabase like 'Database0%';

SourceDatabase Col1
-------------- ----------
Database01 Data01
Database02 Data02

(2 row(s) affected)

The story with parameters is also good! The startup filters we saw earlier make an appearance:

declare @sourceDatabase varchar(20);
set @sourceDatabase = 'Database10';

select *
from dbo.BudgetView
where SourceDatabase = @SourceDatabase;
SourceDatabase Col1
-------------- ----------
Database10 Data10

(1 row(s) affected)

At execution time SQL Server will dynamically determine which branches of the query plan to short-circuit. This is great as it allows a single query plan to be re-used for any value of @sourceDatabase and still deliver the short-circuits benefits.

We've managed to create a view that has the same short-circuit benefits as a pukka partitioned view without having to adhere to all the partitioned view requirements. Our base tables don't have primary keys defined and there are no check constraints. On the downside our view is not updateable, it cannot be inserted into, updated or deleted. If you need all of these features you can either implement INSTEAD OF triggers or go with fully-fledged partitioned views. In this example we used the partitioning column to identify the source database name. Constant partitioning columns are also useful for soft-delete / archive tables, coarse categories, or chunky time periods.

Labels: ,

Monday, 23 April 2007

The NEWSEQUENTIALID Function

posted by Stefan Delmarco

The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance and globally unique values.

Insert Performance and Uniqueness

The best insert performance will always be achieved by using an identity column as the clustering key. With a monotonically increasing identity column as the clustered index, data is always inserted at the end of the table. At a basic level this is equivalent to the performance of a table that has no clustered index (i.e. a heap) where the insert position for the next row is the page where the last row was inserted. Secondly, there will be no page splits. With a normal clustered index, SQL Server will move half the rows on a full page to a newly allocated page in order to make space for the new row. With an identity column on a clustered index SQL Server will not 'split' the page in this manner. It will just allocate another page and continue inserting rows. It has enough smarts to know not to copy half the rows to the newly allocated page as all new rows will be inserted into the new page.

The limitations of identity columns come to the fore when you need have unique keys that span either tables, databases or SQL Servers. For example, say you have multiple branches that can each generate orders. Each branch has a replicated copy of the Orders table, however you don't want duplicate order numbers generated across branches. Common solutions to this problem include creating a composite order number that includes the branch id, reserving ranges of the identity column or making use of a central 'number fountain'. Each of these approaches introduce a level of indirection that must be adhered to before a row can be created. If the order number is not required to be human-readable friendly, GUIDs are often used instead.

GUIDs

GUIDs (Globally Unique Identifiers) are pseudo-random 16-byte (128-bit) numbers whose generation algorithm provides a sufficiently high degree of probability that the same GUID will never be generated twice on any computer at any time. Although a GUID is a binary array, it is often represented in its hexadecimal form using the following format, for example: dbbc2827-edf8-4a2d-92ad-c1e0059304d7. This makes them suitable for the 'distributed uniqueness requirement' of our Orders table.

The problem would be pretty much solved there unless we have some non-trivial performance requirements, especially around insert volumes and speed. As GUIDs are effectively pseudo-random (the degree of randomness / predictability was changed by Microsoft in response to privacy concerns) they do not possess the same orderly insert benefits as identity columns when used as the clustering key. The insert IO pattern changes from sequential to random-access when using GUIDs instead of an IDENTITY column. Not only do GUIDs require far more IO to locate the insert location in the clustered index (the index has to be traversed from the root for every insert), fragmentation at the leaf level leads to less dense pages, which in turn requires defragmentation / reindexing.

NEWSEQUENTIALID()

Enter NEWSEQUENTIALID(), stage left. This is a new system function included in SQL Server 2005 that combines the best of both worlds. NEWSEQUENTIALID() will generate a GUID that will be greater in value than the previously generated one.

It is easy to see how this function works by looking at the output the following TSQL script generates:

create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int);
go

-- Insert 100 rows.
declare @count int;
set @count = 0;
while @count < 100 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

select *
from TestTable
order by id;
go

id sequence
------------------------------------ -----------
FA780E3B-03C2-DB11-BD9F-0011D82F3F23 0
FB780E3B-03C2-DB11-BD9F-0011D82F3F23 1
FC780E3B-03C2-DB11-BD9F-0011D82F3F23 2
FD780E3B-03C2-DB11-BD9F-0011D82F3F23 3
FE780E3B-03C2-DB11-BD9F-0011D82F3F23 4
...  
5C790E3B-03C2-DB11-BD9F-0011D82F3F23 98
5D790E3B-03C2-DB11-BD9F-0011D82F3F23 99

You'll notice that there is a clear bit-shifting pattern that SQL Server is employing to keep the GUIDs increasing in relative value. These are the restrictions in NEWSEQUENTIALID's use:

  • NEWSEQUENTIALID cannot be used in arbitrary TSQL statements. It can only be specified as the DEFAULT value for an uniqueidentifier column
  • More than one column in a table can use NEWSEQUENTIALID
  • It cannot be combined with other scalar functions e.g. REVERSE(NEWSEQUENTIALID()) is not allowed

The network card's MAC address is easily identifiable in the generated GUID. My laptop's MAC address is

   00-01-4A-28-64-8B

This is one of the NEWSEQUENTIALID GUIDs that was generated:

   4EAC7708-30C3-DB11-B902-00014A28648B

Hence the privacy warnings in BOL:

NEWSEQUENTIALID() Privacy Warning

Correlating the position of the MAC address in the GUID with the bits that get incremented / shifted for each successive GUID sheds light on why the GUIDS remain globally unique. The MAC address portion of the GUID does not change.

Performance Comparison

The following test demonstrates the benefits of NEWSEQUENTIALID. The insert performance of the various clustering key variations was measured and compared. For each test the same TestTable was recreated with a different Id column definition:

  • IDENTITY(,)
  • NEWID()
  • NEWSEQUENTIALID()

The following TSQL contains the DDL used to construct the tables for the 3 clustered keys:

-- IDENTITY
create table TestTable (
   id int identity(1,1) not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWID
create table TestTable (
   id uniqueidentifier default newid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWSEQUENTIALID
create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

For each test 50,000 rows were inserted into TestTable table using the following TSQL script:

-- Insert 50,000 rows.
declare @count int;
set @count = 0;
while @count < 50000 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

The following metrics were gathered after each run:

-- Get the number of read / writes for this session...
select reads, writes
from sys.dm_exec_sessions
where session_id = @@spid;

-- Get the page fragmentation and density at the leaf level.
select index_type_desc, index_depth, page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent, record_count
from sys.dm_db_index_physical_stats(db_id(), object_id('TestTable'), null, null, 'detailed')
where index_level = 0;
go

The results are quite compelling:

  Reads Writes Leaf Pages Avg Page Used Avg Fragmentation Record Count
IDENTITY(,) 0 1,683 1,667 98.9% 0.7% 50,000
NEWID() 0 5,386 2,486 69.3% 99.2% 50,000
NEWSEQUENTIALID() 0 1,746 1,725 99.9% 1.0% 50,000

Most striking is the number of writes required by the NEWID system function. This, coupled with the average page density of 69%, is evidence of the page splitting caused by the random distribution of inserts at the leaf level. As soon as a page fills up, it needs to be split into 2 pages of 50% each for the insert to complete. Not only has page splitting resulted in poor page density, it has fragmented the data pages quite badly (there is a 99% probability that the next data page is not next to the current one). In our tests the most likely place for a free page required for the page split is at the end of the table irrespective of where the row is being inserted. Therefore to read the rows in order the scan needs to keep jumping back and forth between widely distributed split pages, hence the appalling fragmentation.

The minor difference in page count between IDENTITY and NEWSEQUENTIALID metrics is due to the difference in size between IDENTITY's INT (4 bytes) and NEWSEQUENTIALID's UNQUEIDENTIFIER (16 bytes). However, note that any non-clustered index on the UNIQUEIDENTIFIER table will be 4 times larger than the same index on the IDENTITY table. Therefore IDENTITY is still preferred over NEWSEQUENTIALID when choosing between the two.

The NEWSEQUENTIALID system function has clearly lived up to its claim of providing GUID-like uniqueness coupled with identity-level insert performance. The number of writes, fragmentation and page density are all inline with identity-level metrics. These benefits make NEWSEQUENTIALID a compelling feature.

Labels: ,

Monday, 12 March 2007

When a String Is Not a String

posted by Stefan Delmarco

This article was inspired by a seemingly innocent bit of code that caused a performance problem in a very unexpected way. It is the kind of code that you'd find in many samples. Unfortunately, it has a potential flaw that is only exposed when large data volumes are present.

The Problem Code

The Data Access Layer (DAL) code in question is as follows:

public static class AccountDal
{
   public static string GetTerritoryForAccount(string accountNumber)
   {
      using(SqlConnection con = new SqlConnection(@"Server=localhost;Database=AdventureWorks;
         Integrated Security=SSPI"
))
      using(SqlCommand cmd = con.CreateCommand())
      {
         cmd.CommandText = @"
            select t.Name
            from Sales.Customer c
               inner join Sales.SalesTerritory t
                  on c.TerritoryID = t.TerritoryId
            where c.AccountNumber = @AccountNumber"
;
         cmd.CommandType = CommandType.Text;
         cmd.Parameters.AddWithValue("@AccountNumber", accountNumber);

         con.Open();
         return cmd.ExecuteScalar() as string;
      }
   }
}

Very simple stuff. Open a connection to SQL, select a specific row from the Sales.Customer table, join the resultant row to the Sales.SalesTerritory table and return the name of the territory. The code is making diligent use of parameters for reusable query plans, strong typing and to prevent SQL injection attacks. The following test exercises the code:

[TestClass]
public class AccountDalTests
{
   [TestMethod]
   public void CanGetTerritory()
   {
      string territory = AccountDal.GetTerritoryForAccount("AW00000010");

      Assert.AreEqual(territory, "Canada");
   }
}

This test passes, so functionally the code works as expected. Running this bit of code on a decently spec'd developer workstation takes about 800 ms. Running it repeatedly doesn't improve on this figure. The Sales.Customer table contains 19,185 rows. Given that we're looking up a single row in the Sales.Customer row the alarm bells should be ringing, especially as the Sales.Customer has a nonclustered, unique index on AccountNumber:

exec sp_helpindex 'Sales.Customer';

index_name index_description index_key
-------------------------- ----------------------------------------------------- --------------------
AK_Customer_AccountNumber nonclustered, unique located on PRIMARY AccountNumber
AK_Customer_rowguid nonclustered, unique located on PRIMARY rowguid
IX_Customer_TerritoryID nonclustered located on PRIMARY TerritoryID
PK_Customer_CustomerID clustered, unique, primary key located on PRIMARY CustomerID

Diagnosing the Issue

Capturing the TSQL being executed in SQL Profiler and running it through Query Analyzer shows the following query plan. The query plan shows exactly where all the time is being spent: Original Query Plan The IX_Customer_TerritoryID index is being used. However, we're performing an index scan of all the AccountNumbers instead of just an index seek to the specific row we're returning. The compute scalar step following the index scan shows precisely why. Expr1004 is the output of an implicit conversion of every AccountNumber to NVARCHAR(10). Looking at the definition of the Sales.Customer table shows that AccountNumber is, in fact, a VARCHAR(10):

select column_name, data_type, character_maximum_length
from information_schema.columns
where table_catalog = 'AdventureWorks'
 and table_schema = 'Sales'
 and table_name = 'Customer';

column_name data_type character_maximum_length
-------------- ------------------- ------------------------
CustomerID int NULL
TerritoryID int NULL
AccountNumber varchar 10
CustomerType nchar 1
rowguid uniqueidentifier NULL
ModifiedDate datetime NULL

Looking at the TSQL the application sends to SQL Server, we can see that it is binding @AccountNumber as an NVARCHAR(10):

exec sp_executesql N'
   select t.Name
   from Sales.Customer c
      inner join Sales.SalesTerritory t
         on c.TerritoryID = t.TerritoryId
   where c.AccountNumber = @AccountNumber'
,
   N'@AccountNumber nvarchar(10)',
   @AccountNumber=N'AW00000010'

So the application is binding an NVARCHAR(10) parameter that is then being compared to a VARCHAR(10) column. SQL Server then converts every row of the VARCHAR(10) column to an NVARCHAR(10) data type. Clearly this is where the performance problem is. The scalability of this query is therefore linearly dependant on the number of rows in the table. The more rows there are, the more values need to be converted. This not a good situation to be in.

Data Type Precedence

Whenever SQL Server is asked to compare two differing data types, it uses its data type precedence rules to determine which data type needs to be converted to perform the comparison. Assuming an implicit conversion exists, the data type that appears lower in the list is converted to the data type that appears higher in the list. SQL Server's Data Type Precedence order has NVARCHAR appearing higher than VARCHAR. Therefore, all VARCHAR data types need to be converted to NVARCHAR in order to evaluate the AccountNumber values.

Unfortunately, in our case, we have ~20,000 VARCHARs to convert to NVARCHAR and 1 NVARCHAR @AccountNumber! This is the reason for the index scan in the query plan. Note that with SQL Server 7.0, the query optimiser would always convert the literal and never the column. SQL Server 2000/5 strictly adheres to the Data Type Precedence rules, even if it means having to convert 19,185 rows. This change in behaviour is documented in KB 271566.

Looking back at the original C# code, we didn't specify a SqlDbType when adding the @AccountNumber parameter. The SqlParameter class will infer the SqlDbType from the object value specified if it is not specified explicitly. A System.String is always interpreted as SqlDbType.NVarChar as all .NET strings are Unicode.

There are a number of possible solutions to this problem:

  • Change the C# code to have knowledge of the SqlDbTypes the parameters are binding to
  • Standardise by only allowing inferred SqlDbType types to be used in your database

Personally, I prefer the first option by a country mile. The second option prevents you from designing a data model that fully exploits all of the advantages of the SQL Server domain. For example, disallowing VARCHARs will double the amount of storage required to store your VARCHARs. I prefer to use a data type dictionary.

Data Type Dictionary

Data dictionaries are normally only used within databases. However, introducing a data type dictionary in the DAL allows you to have fine-grained control over your data types as well as other data type facets such as length and precision. In addition, it provides for better type safety as users of the data dictionary can only pass a .NET string to get an AccountNumber SqlParameter object. Any other .NET would obviously be a compile time error. Lastly, should you ever have to change data types (e.g. accountNumber becomes a System.Int32) it would be easy to make the change to the GetAccountNumber method and get compile time errors where string accountNumbers are still being used.

public static class SqlDictionary
{
   public static SqlParameter GetAccountNumber(string accountNumber)
   {
      SqlParameter parameter = new SqlParameter("@AccountNumber", SqlDbType.VarChar, 10);
      parameter.Value = accountNumber;

      return parameter;
   }
}

With this change, the query plan now looks far more respectable: Updated Query Plan

Relying on inferred Object / Relational data mappings can lead to unexpected issues. This is a classic example of the ORM-impedance mismatch that so many tools try to abstract away. However, by utilising a simple data dictionary, many of these problems can be easily solved by writing disciplined code.

Default Implicit Conversion

SQL Server 2005 has mediated the data type precedence trap somewhat. For example, in SQL Server 2000, comparing a BIT column to a literal 1 always resulted in the BIT column being upcast to a TINYINT. Whilst this may sound a bit bizarre, the problem was that the literal number 1 was always implicitly assigned the TINYINT data type and TINYINT appears above BIT in the data type precedence rules. SQL Server 2005 now applies a bit more intelligence. It assigns the literal the same data type as the column so no conversion is required.

So, if you're still using SQL Server 2000, be sure to specify your BIT literals as 0x1. Whilst this doesn't equate to a BIT data type, its BINARY datatype is lower on the data type precedence rules and is therefore implicitly upcast to a BIT data type. Using this standard you'll never fall into the trap of upcasting all your BITS to TINYINTs. Note that if you run a SQL Server 2005 database in 80 compatability level you'll get the old SQL Server 2000 behaviour.

-- SQL Server 2000 implicit conversion issue.
-- SQL Server 2005 OK.

select ProductID
from Production.Product
where FinishedGoodsFlag = 1;

-- SQL Server 2000 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 0x1;

The implicit conversion is quite easy to spot in the query plan.

Convert SQL 200 vs 2005

Labels: ,

Thursday, 1 March 2007

KB Article Hidden Gems

posted by Stefan Delmarco

One habit that has paid off repeatedly has been to loyally scan through PSS's SQL Server 2005 KB Article RSS Feed on a daily basis. There isn't a huge amount of volume; some days as many as 5 new articles, most days none. It works very well as my SQL Server gossip column. The KB articles tell me what the latest issues are and what information is considered critical enough by PSS to warrant a KB article.

For example, yesterday the following KB article was published:

FIX: I/O requests that are generated by the checkpoint process may cause I/O bottlenecks if the I/O subsystem is not fast enough to sustain the IO requests in SQL Server 2005

If you take the time to scroll to the very end of the article, you'll find the More Information section. This is where the really good stuff is kept. This specific hotfix is not what you would expect. The hotfix adds an additional SQL Server command line option to throttle the I/O requests to the disk subsystem. I/O subsystem issues are common when dealing with SQL Server so knowing about this specific issue and the existence of the hotfix will give you some options when investigating I/O subsystem issues.

Labels:

Monday, 20 November 2006

The TABLESAMPLE Clause

posted by Stefan Delmarco

When I first same across TABLESAMPLE I had visions of some advanced index hints that allowed fine grained control over table statistics (distribution, ranges, density, etc.). Unfortunately not. As per Occam's Razor the simplest explanation is the correct one. It just... returns a sample from the specified table...?

The complete syntax for TABLESAMPLE is as follows (from BOL):

TABLESAMPLE

Given this syntax, the most obvious query to test drive TABLESAMPLE would be to return, say, 10 rows from the Sales.SalesOrderHeader table. After all, this syntax looks a lot like TOP:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (10 rows);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

How unexpected is that! Zero rows returned?! Huh? For the sake of adding to the confusion, change the query to return 1,000 rows as follows:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (1000 rows);

Then run the query a few times and observe how many rows get returned:

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44137 2001-09-01 00:00:00.000 3953.9884
...

(938 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44407 2001-10-16 00:00:00.000 3729.364
...

(1263 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44757 2001-12-01 00:00:00.000 49953.7086
...

(1085 rows(s) affected)

Each execution returns a different number of rows! Some of them just below 1,000 and others just above. Clearly there is a non-deterministic factor at play here. Without any data changing, re-running the identical query keeps giving different results. Digging into the BOL documentation shows that the TABLESAMPLE algorithm has a very strong statistical slant to it. It works as follows:

For the table affected by the TABLESAMPLE, SQL Server needs to determine a probability that a page of that table's data will be sampled. This probability is equal to the PERCENTAGE specified in the TABLESAMPLE. Alternatively, if a number of ROWS was specified, that number is divided by the total number of rows in the table to calculate the probability. In our case, the Sales.SalesOrderHeader table has 31,465 rows. We asked for 1,000 rows in the second query. Therefore the probability is calculated as 3.2%.

Once this probability has been calculated, SQL Server will evaluate it for every page of the table. The details of the algorithm aren't documented but it is along the lines of:

  • For each page, a random number between 0 and 1 is generated (e.g. RAND()).
  • If the random number is greater than the probability, SQL Server will pretend that the page for the table does not exist (i.e. it is not sampled / 'out-of-scope').
  • Otherwise, SQL Server considers all of the rows on the page sampled / 'in scope'.

This is very revealing as it explains the behaviour we saw with the first query. We asked for a sample of 10 rows. SQL Server therefore calculated the probability of a page being returned as:
10 / 31,465 = 0.032 %, a very small number. If we execute the query once, there is a chance SQL Server will not generate a random number that falls within this small probability. However, we can increase our chances of getting at least one page returned by executing the select statement a number of times:

declare @loopCount int;
set @loopCount = 0;

declare @rowcount int;
set @rowCount = 0;

-- Keep looping till we manage to sample at least one page.
while @rowCount = 0 begin
    select SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader tablesample system (10 rows);

    set @rowcount = @@rowcount;
    set @loopCount = @loopCount + 1;
end;

raiserror('Required %d loops to return %d rows', 0, 1, @loopCount, @rowCount);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
69998 2004-05-06 00:00:00.000 2649.8453
...

(46 rows(s) affected)

Required 5 loops to return 46 rows

Interestingly enough, when we get back some rows (i.e. we get a probability hit) we get 91 rows returned instead of 10. This reveals another facet of the TABLESAMPLE algorithm. It only goes as far as the probability calculation of a page for sampling data. Once it has decided that a page of the table is to be included in the sample, it will not limit the number of rows sampled from that page. Therefore you are always liable to get back the number of rows equivalent to the page size divided by the row width (assuming, of course, that the pages are densely packed). Running the above query repeatedly returns 40-45 rows, indicating that a page of the Sales.SalesOrderHeader table typically holds 40-45 rows per page.

A couple of additional points to note:

  • The TABLESAMPLE is applied before any WHERE clauses are evaluated.
  • This includes any JOIN conditions as well. Some misleading results can be obtained if two tables that are JOIN'd have a TABLESAMPLE applied to them. As the sampling of the tables are independent and uncorrelated, there is no guarantee that related rows from both tables will be included.
  • The TOP clause can be combined with TABLESAMPLE to cap the number of rows returned. We saw earlier how the TABLESAMPLE algorithm can return more rows than requested. However, there is still every chance that less rows will be returned than was specified in the TABLESAMPLE.

The REPEATABLE option of the TABLESAMPLE clause basically causes SQL Server to use the same random numbers when evaluating each of the table's pages again (SQL Server is probably using a linear congruential generator to obtain the random numbers with the same seed). The net effect is that the same pages will be sampled assuming no underlying data has been changed.

In terms of practical applications of the TABLESAMPLE algorithm, I really struggled to think of a situation where it would have been useful. At face value, all that TABLESAMPLE does is return a certain number or percentage of 'random-ish' rows from the specified table. Unless you have some highly-specialised requirement to get a 'flavour' for the data in the table, this new clause isn't that useful. It doesn't augment the TOP / ROWCOUNT clause at all and doesn't expose any customisable or extensible behaviour.

I strongly suspect that the TABLESAMPLE clause was added specifically for the optimiser's gathering of statistics. Anyone who has seen STATMAN scrolling past in a SQL Profiler trace will know that the update of statistics uses a number of undocumented TSQL operators and index hints. For example:

select SalesOrderId, OrderDate, TotalDue
from (
    select top 100 percent SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader with (readcommitted, sample 10e-2 percent)
    order by 1, 2, 3)
option (bypass optimizer_queue, maxdop 1);

I count at least 2 undocumented operators and 1 view ordering workaround being used in this query. TABLESAMPLE is definitely a much cleaner implementation for a query that appears to be performing the same function.

Who knows, maybe one day your boss will ask your team: "You know what, it would be really cool if we could return random pages from this table, according to a probability we specify, get back different results every time we execute it, and even return vastly different numbers of rows than what we asked for". You'll be the first one to raise your hand...

Labels: ,

Friday, 13 October 2006

The UNPIVOT Operator

posted by Stefan Delmarco

Completing the discussion on the PIVOT operator is its antonym, the UNPIVOT. However, I would argue that the UNPIVOT operator has been misnamed by Microsoft as it creates the mistaken impression that PIVOT -> UNPIVOT -> PIVOT gets you back to where you started. From BOL:

UNPIVOT performs the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

It fails to highlight that fact that PIVOT has aggregated the raw data in order to rotate the underlying rows into columns - you have to read to the very end of the Using PIVOT and UNPIVOT sample for the fine print in BOL. In contrast, the UNPIVOT operator does not perform any aggregation whatsoever. It just turns columns into rows. For each column in the table being UNPIVOT'd you get a row in the result set where the row values are ColumnName and ColumnValue. It therefore helps to think of transpose instead of UNPIVOT as this is more accurate description of the UNPIVOT operator's behaviour.

Nevertheless, let's have a look at the syntax to try and get a better feel for what UNPIVOT brings to the table. As a worked example, consider the following requirement: Write a query to produce a result set that contains a list of products (Production.Product) and their SellStartDate and SellEndDate. The SellStartDate and SellEndDate need to be listed in a single column with a second column specifying whether it is the SellStartDate or SellEndDate value being specified. Lastly, the result set needs to be sorted by this consolidated date column to produce a linear timeline of 'product events'.

In SQL Server 2000, you'd probably end up with a query that looks very much like the following:

select ProductId, [Name], 'SellStartDate' as DateType, SellStartDate as DateValue
from Production.Product
union all
select ProductId, [Name], 'SellEndDate' as DateType, SellEndDate as DateValue
from Production.Product
where SellEndDate is not null
order by DateValue;

ProductId Name DateType DateValue
----------- ----------- ----------- -----------
...
320 Chainring Bolts SellStartDate 1998-06-01 00:00:00.000
321 Chainring Nut SellStartDate 1998-06-01 00:00:00.000
322 Chainring SellStartDate 1998-06-01 00:00:00.000
323 Crown Race SellStartDate 1998-06-01 00:00:00.000
324 Chain Stays SellStartDate 1998-06-01 00:00:00.000
325 Decal 1 SellStartDate 1998-06-01 00:00:00.000
326 Decal 2 SellStartDate 1998-06-01 00:00:00.000
327 Down Tube SellStartDate 1998-06-01 00:00:00.000
328 Mountain End Caps SellStartDate 1998-06-01 00:00:00.000
329 Road End Caps SellStartDate 1998-06-01 00:00:00.000
330 Touring End Caps SellStartDate 1998-06-01 00:00:00.000
...

Using the UNPIVOT operator, in SQL Server 2005, the equivalent is as follows:

select ProductId, [Name], DateType, DateValue
from
    (select ProductID, [Name], SellStartDate, SellEndDate
    from Production.Product) as rawData
unpivot
    (DateValue for DateType in
       (SellStartDate, SellEndDate)) as transposed
order by DateValue;

The key to understanding the UNPIVOT syntax is to see how the DateValue and DateType columns are specified. When transposing the columns into rows, they effectively become name / value pairs. The name is the name of the column and value is the value of the column for the relative row. BOL calls it value_column and pivot_column but just think of them as name / value pairs.

The first portion of the query selects the raw data which we will be transposing. The UNPIVOT portion of the query requires the name of the columns to be used for the name and value pair columns (Value FOR Name) followed by the columns that will be transposed. Lastly, the outermost select allows you specify which transposed columns you want to return, so you can choose not to return the Name column if you don't want it.

As the SELECT...CASE statements were able to duplicate all the functionality of the PIVOT operator, SELECT...UNION ALL can perform all the same tricks as UNPIVOT. However, the advantage again is a more compact and expressive syntax.

One behaviour of UNPIVOT to watch out for is that it will not transpose column values that are NULL. NULL values will not appear as name / value pairs in the results. In the sample above I've compensated for this behaviour by including where SellEndDate is not null in the SQL Server 2000 equivalent query. Note that there is way to change this behaviour.

In terms of performance, the query optimiser highlights a significant difference between the SELECT...UNION ALL and UNPIVOT queries. The former requires a pass over the underlying table for each select (2 passes in our sample) whereas the UNPIVOT requires only a single pass. SQL Server is able to transpose the columns into rows for each column in a single pass over the underlying data. This makes the UNPIVOT query less expensive than the equivalent UNION ALL. The size of this penalty is proportional to the number of rows being processed. This fact along makes UNPIVOT a very useful tool to keep note of in your TSQL arsenal.

Labels: ,

Sunday, 1 October 2006

SPID Blocks Itself...?

posted by Stefan Delmarco

Came across this useful KB the other day: 906344, The blocked column in the sysprocesses table is populated for latch waits after you install SQL Server 2000 SP4. The technical detail may be a bit confusing. Essentially the SPID is doing a convoluted "WaitForSingleObject"-type block, waiting for a data page to be fetched into memory from disk. If you see a lot of this type of blocking you may want to check your other performance counters as you may have poor performing IO or buffer pool memory pressure.

Labels:

Saturday, 23 September 2006

Microsoft SQL Server Development Team Blogs

posted by Stefan Delmarco

The best source for SQL Server internals and advanced TSQL are the Microsot SQL Server development teams' blogs. Some are more active than others, but they're all worth subscribing to for that gem you don't want to miss. For example, the SCHEMABINDING tip is a real nugget.

Labels:

Friday, 22 September 2006

For Database Professionals

posted by Stefan Delmarco

Visual Studio 2005 Team Edition for Database Professionals CTP 5 - Beta, download here. An update patch for CTP 5 already available

Labels: ,

The PIVOT Operator

posted by Stefan Delmarco

SQL Server 2005 introduces the PIVOT operator to the TSQL stable. The intention of this operator is to make it easier to transform unique values in rows into columns and aggregate the resulting data. Anyone familiar with Excel pivot tables or Access cross-tab queries will immediately be able to visualise what pivot tables look like. Prior to SQL Server 2005, the only way to transform data, in the manner provided by PIVOT, was to use a number of SELECT...CASE statements. As we'll see, the PIVOT operator has not given us functionality that was not possible before. However, it has made pivoting data much easier to perform and far more readable.

In order to understand how to form a PIVOT query, it is important to be able to describe, in plain English, what the query is trying to accomplish. Personally I have found that thinking of the PIVOT result set in terms of a graph (i.e. a X- and Y- axis) makes it easier to formulate into TSQL.

As a worked example, consider the following requirement. You need to generate a report that will correlate the number of addresses that were changed per city for each year. In other words, formulate a result set that will have each city listed as a column (X-axis). For each city, count the number of times an address in that city was modified in a given year (Y-axis). In SQL Server 2000, you'd probably create some TSQL similar to the following:

select datepart(yyyy, modifiedDate) as [WhenUpdates],
    sum(case city when 'Monroe' then 1 else 0 end) as [Monroe],
    sum(case city when 'Redmond' then 1 else 0 end) as [Redmond],
    sum(case city when 'Snohomish' then 1 else 0 end) as [Snohomish],
    sum(case city when 'Issaquah' then 1 else 0 end) as [Issaquah]
from Person.Address
group by datepart(yyyy, modifiedDate)
order by 1;

WhenUpdates Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

A couple of things to note:

  • Even though there are many more cities in the Person.Address table than the ones specified, we must 'hardcode' the city names into the query. There is no way to dynamically create additional columns if a new city is encountered. That fact alone makes this query difficult to maintain. If another city is added to Person.Address, this query would have to be modified.
  • The Y-axis (WhenUpdated) column is dynamic, we do not require prior knowledge of the modifiedDate's content when formulating the query. When an address is modified in a new year (e.g. 2005), another row will automatically be added to this result set without any changes.

Now, for the PIVOT operator. There is good news and bad news. The good news is that the syntax is far more compact and readable. The bad news is that it overcomes none of the limitations mentioned above! However, as we'll see, the good news is still worthwhile...

A PIVOT query is split into a number of sections. First, you need to write the SELECT statement that provides you with the raw, non-pivoted data. In our example, the raw data would be the following query:

select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
from Person.Address;

WhenUpdated AddressID City
----------- ----------- ------------------------------
...
2004 19 Bothell
1999 20 Bothell
1999 21 Bothell
2001 22 Portland
1999 23 Seattle
2001 24 Duluth
2002 25 Dallas
2001 26 San Francisco
...

WhenUpdated is the Y-axis, AddressID is a data point - we're going to count AddressIDs for a given (X,Y) / (City,WhenUpdated) combination - and City is the X-axis. Once we have the raw data, it is simple case of articulating which column is used for what function in the pivot:

select Pvt.*
from (
    -- The raw data as a derived table...
    select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
    from Person.Address) as RawData
pivot (
    count(AddressID)  -- How each X,Y point is calculated...
    for City          -- The X-axis, stated explicitly...
    in ([Monroe], [Redmond], [Snohomish], [Issaquah])) as Pvt
order by 1;

WhenUpdated Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

At first glance the PIVOT section seems a bit unintuitive. However, if you work through the syntax it will start to make sense. The PIVOT operator needs to know how to calculate each (X,Y) data point's value, specified as an aggregate function. The FOR column contains the X-axis values and the IN contains the hard-coded list of X-axis columns. As you can see, there is no getting around the SELECT...CASE limitation of having to know the transformed columns before-hand. In case you are wondering: no, IN cannot take a wildcard, e.g. IN(*). I believe that the limitation exists due to the optimizer's requirement to have fixed metadata when the execution plan is generated. This would not be possible if the column metadata is dynamically generated from the data itself.

The Y-axis is inferred as the distinct set of all columns that have not been used by the aggregate function or by the X-Axis. In our case, that would be the WhenUpdated computed column.

In terms of execution plans, the SELECT...CASE and PIVOT produce almost identical plans. However, as we're throwing away more of the rows in the Person.Address table (only 267 of the 19,614 rows are for the cities in our X-axis), both queries would be far more efficient if they included the set of X-axis column values (i.e. the names of the cities) in their WHERE clauses.

Instead of trying to dream up dynamic-SQL ways of working around this problem (non-dynamic columns and rows falling outside the predefined X-axis values) I would argue that you're better off aggregating the data in TSQL (using 'traditional' SELECT...GROUP BY statements) and transforming it in your middle tier / client application where there are better data structures that can be used to represent extremely wide or sparse result sets.

A better pattern to follow is to realise that the X-axis can be used to specify fixed ranges instead of raw values. Using this technique you can select ranges that will guarantee that you have covered off all possible X-axis values. For example, instead of allowing the raw unique values of a column to destabilise the PIVOT, the following query maps the values in the TotalDue column in a number of well-known buckets and then uses those values as the X-axis. Using this technique there is no possibility of having rows fall outside the X-axis range:

select Pvt.*
from (
    select datepart(yyyy, OrderDate) as [OrderDate],
        case
            when TotalDue < 100 then 'NoMargin'
            when TotalDue >= 100 and TotalDue > 1000 then 'OnTarget'
            else 'ExtraBonus' end as [Profit],
        PurchaseOrderId
    from Purchasing.PurchaseOrderHeader) as RawData
pivot (
    count(PurchaseOrderId)
    for Profit
    in([NoMargin], [OnTarget], [ExtraBonus])) as Pvt
order by 1;

OrderDate NoMargin OnTarget ExtraBonus
----------- ----------- ----------- -----------
2001 0 4 4
2002 7 105 160
2003 26 399 607
2004 68 1070 1550

As we've seen, the PIVOT operator does not provide any transformational capabilities that weren't possible in SQL Server 2000 (with liberal use of SELECT...CASE). However, as any experienced developer will know, the easier it is to express a reasonably complex SET-based problem, the better chance you have of getting it right, first time. Even more importantly, the poor developer required to maintain it (after you've moved on to greater conquests) will have a better chance of getting it right for the second time!

Labels: ,

Saturday, 26 August 2006

The APPLY Operator

posted by Stefan Delmarco

We've come across the APPLY operator before when we were exploring DMVs. The APPLY operator has been added to the TSQL repertoire in SQL Server 2005. It fills an important gap with User-Defined Functions (UDFs).

UDFs are less well-known than stored procedures (SPs) and are often overlooked by inexperienced SQL developers. BOL has a good write-up on how to choose between SPs and UDFs. For the context of this article, a brief overview of UDFs:

UDFs come in two flavours: those that returns scalars and those that return tables. Scalar-valued UDFs are very straightforward. They take in a bunch of parameters and return a single scalar value. These are functions in the traditional VB / C# / C++ sense. Table-valued functions are a little different. They also can take in a bunch of parameters. However, they are able to return a table instead of a scalar value. This returned table is referenced in TSQL in the same manner as a table or a view. This makes them look like a cross between functions and views. In fact, there are two types of table-valued UDFs. Those that contain a single SELECT statement (they look very much like a derived table) called 'Inline Table-Valued Functions' and those that build-up the return table using procedural logic (they look like stored procedures) called 'Multistatement Table-Values Functions'. It is the latter-type that we are interested in.

Multistatement table-valued functions are very powerful as they allow us to encapsulate complex procedural logic in a function when that logic cannot be implemented in a database view (i.e. you cannot express that desired output using a single set-based TSQL expression). You are free to construct the output rows in what ever manner you wish! The only restriction is that you need to declare the schema of the table that the UDF returns upfront as part of the UDF's declaration.

As an example, consider the following table-valued UDF:

create function dbo.udf_DecomposeDns (
    @webServiceUrl nvarchar(1024))
returns @dnsDomains table (
    domain nvarchar(1024))
as begin
    -- Return no rows if NULL...
    if @webServiceUrl is NULL
        return;

    declare @dns nvarchar(1024);

    -- Remove the contents after the DNS...
    declare @slashIndex int;
    set @slashIndex = charindex(N'/', @webServiceUrl);
    if @slashIndex > 0
        set @dns = left(@webServiceUrl, @slashIndex - 1);
    else
        set @dns = @webServiceUrl;

    -- Decompose the DNS into its hierarchical parts...
    declare @dotIndex int;
    set @dotIndex = charindex(N'.', @dns);
    insert @dnsDomains (domain) values (@dns);

    while @dotIndex > 0 begin
        set @dns = right(@dns, len(@dns) - @dotIndex);
        insert @dnsDomains (domain) values (@dns);
        set @dotIndex = charindex(N'.', @dns);
    end;
    return;
end;

This UDF will decompose a URL into its hierarchical DNS domain constituents. For the sake of brevity the URL is assumed to consist of a DNS domain name followed by a URL-path (as in AdventureWorks' Purchasing.Vendor PurchasingWebServiceURL column). For example:

select *
from dbo.udf_DecomposeDns(N'www.extranet.fotia.com/order.asmx')

domain
-------------------------
www.extranet.fotia.co.uk
extranet.fotia.co.uk
fotia.co.uk
co.uk
uk

So far, all this is well within the capabilities of SQL Server 2000. Now, consider the following requirement: Generate a report that will contain the DNS composition of each Vendor's PurchasingWebServiceURL in AdventureWorks' Purchasing.Vendor table.

To achieve this we want to decompose each URL into its domain hierarchy. For each vendor we would expect multiple rows to be returned, equivalent to the number of parts in that Vendor's URL. This would not be possible to do in a single SELECT statement in SQL Server 2000. The SELECT syntax did not have an operator / syntax that would allow a table-valued UDF to be re-evaluated repeatedly for every row it was being joined to. We could only execute a table-valued UDF once in a SELECT statement. We could pass a parameter to the UDF but the parameter value could not be from any column in the resulting set. It could only be a constant value or a declared variable.

This is the gap that the APPLY operator fills in SQL Server 2005. It allows us to get the table-valued UDF to be executed for each row it is being joined to. In this manner the parameters to a table-valued UDF should be seen as the join criteria.

The APPLY operator can be used in two forms: CROSS APPLY and OUTER APPLY. To understand the difference think of the table-valued UDF as a normal table you join to. You'd either use an INNER or an OUTER join. CROSS APPLY is an INNER JOIN, and OUTER APPLY is an OUTER JOIN. If the UDF returns an empty result set for a specific row, that row will be excluded if CROSS APPLY is used. However, if OUTER APPLY is used that row will still be included. NULL values will be substituted for the missing UDF row (identical to a traditional OUTER JOIN). It's that simple.

CROSS APPLY over the Purchasing.Vendor table produces the following result:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    cross apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
------------------- ------------------------------ --------------------------
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Wide World Importers www.wideworldimporters.com/ www.wideworldimporters.com
...

Compare the output with OUTER APPLY. Note the NULL rows substituted for the missing rows from the UDF:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
---------------------------------- ------------------------ --------------------
International NULL NULL
Electronic Bike Repair & Supplies NULL NULL
Premier Sport, Inc. NULL NULL
Comfort Road Bicycles NULL NULL
Metro Sport Equipment NULL NULL
Green Lake Bike Company NULL NULL
Mountain Works NULL NULL
Continental Pro Cycles NULL NULL
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Anderson's Custom Bikes NULL NULL
Compete, Inc. NULL NULL
...

We are able to bring the full might of TSQL to bear on table-valued UDFs. For example, aggregations can be run on the result set:

select domain, count(*) as [CountOf]
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL)
group by domain
order by 2 desc;

domain CountOf
----------------------------- -----------
NULL 98
com 5
litwareinc.com 1
net 1
northwindtraders.com 1
proseware.com 1
treyresearch.net 1
...

The APPLY operator is indeed a valuable addition to the TSQL. It allows a whole new modular and procedural approach to composing result sets.

Labels: ,

Monday, 7 August 2006

Introducing Common Table Expressions (CTEs)

posted by Stefan Delmarco

The concept of CTEs has been around in SQL Server for a long time. Anyone who's used derived tables before will immediately recognise and understand CTEs. Think of CTEs as formalised sub-queries or temporary views that can be defined and referenced in a single TSQL statement.

I've labelled this article as an introduction as CTEs bring a lot to the TSQL table, which we'll explore in much greater detail in future articles.

Consider the following requirement: you need a result-set that returns the sum of all products sold per product. For each product, you'll need to return a bunch of metadata about that product. In addition, your boss (for reasons only known to him) want these results only for products that are either red in colour or have more than 10 transactions! In SQL Server 2000 you might start writing something similar to the following:

select [Name], Color, sum(Quantity) as [SumQuantity]
from Production.Product P
    inner join Production.TransactionHistory T
        on P.ProductId = T.ProductId
where P.Color = 'red'
    or (    select count(*)
            from Production.TransactionHistory THinner
            where THinner.ProductId = P.ProductID) >= 10
group by [Name], Color;

Whilst this query gets the job done, it does look a bit messy as we're having to perform a correlated sub-query for each ProductID, evaluate it as a scalar and then determine whether or not we're going to calculate the sum of all quantities of all transactions for the associated ProductID. In addition, this TSQL is misrepresenting its purpose in the GROUP BY clause. The requirement stated that the summation needs to be performed per ProductID. We probably got lucky here as Name is unique per Product. If it wasn't we could get some unpredictable results (e.g. if there were two Pink FooBar's that differed only by some other metadata). Any column in the SELECT list that is not an aggregate must be included in the GROUP BY. It is the mix of aggregate and non-aggregate data that has lead to this mistake. A correct way to write this query would be to use a derived table:

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA
        on P.ProductId = PA.ProductID
where PA.[CountOfTransactions] >= 10
 or P.Color = 'red';

This is a much better as we have separated out the aggregate from the additional metadata portion of the query. The inner select statement is a derived table that is uncorrelated with the outer select so that it can execute independently. We have given this derived table the PA alias. Whilst this has given the TSQL much better structure, it does have some limitations. If we wanted to refer to the derived table again we would have to copy-paste it again into the select. For example, if we now wanted to include number of transactions the modulo 10 of the ProductID contained (simulating an arbitrary Product-to-Product relationship), we'd have to copy-and-paste the whole derived table again:

select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA1
        on P.ProductId = PA1.ProductID
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.[CountOfTransactions] >= 10
 or P.Color = 'red';

Now to introduce CTEs: Rewriting the above query to use a CTE will look as follows,

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)
select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join ProductAggregate PA1
        on P.ProductId = PA1.ProductID
    inner join ProductAggregate PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.CountOfTransactions >= 10
or P.Color = 'red';

As a developer, think of a CTE as a local variable declaration, having scope of a single TSQL statement. The WITH statement is the declaration that defines the columns (schema) of the CTE. The AS portion defines the implementation of the CTE. Once declared, the CTE exists as any normal table would in the TSQL statement. As demonstrated above, the CTE can be reused as many times are required in the TSQL statement. In the example above we're using it twice (with different aliases) yet we don't have to re-declare it (as we were required to with derived tables). CTE re-use extends to UNIONs as well. For example:

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where PA.CountOfTransactions >= 10

union all

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where P.Color = 'red';

A couple of other points to note about CTEs:

  • A CTE can only be used in the TSQL statement it is declared in. If you find yourself repeating a CTE a number of times then you should consider 'upgrading' it to a database view in its own right (assuming you're not using a CTE-only feature - mentioned below).
  • CTE's can be used in SELECT / UPDATE / DELETE / INSERT statements, anywhere a table can be used.
  • The previous statement of any TSQL statement that uses a CTE must be terminated with a semi-colon. This can be a bit infuriating if you're not used to using semi-colons in your TSQL but try to get into the habit of terminating all your TSQL statements with semi-colons (this works even in SQL Server 2000).

Anyone that's looked at CTEs before will have noted that I have not yet mentioned the one feature that everyone immediately highlights when discussing CTE's, i.e. CTE's can be used recursively to join on themselves. In my opinion, the recursive properties of CTEs will only be used in the very few queries that need to unroll / traverse hierarchies. The ability to write modular set-based queries is the primary reason we should all start using CTEs. They make queries easier to maintain and allow complex queries to be broken down into their logical components without sacrificing the power of set-based queries (as opposed to poorly performing but easier to read procedural TSQL).

Labels: ,

Monday, 24 July 2006

VARCHAR(MAX)

posted by Stefan Delmarco

VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) are the SQL Server 2005 replacements for TEXT, NTEXT and IMAGE large. Not only are the replacements far better named than their SQL Server 2000 counterparts, they are also far easier to use. However, before we start rejoicing in their virtues, a word of caution. Just because you can now assign a 2GB string to a local variable doesn't mean you should! The ease with which large-value data types can be accessed and manipulated means that we have a greater need to treat the data responsibly and be aware that we could be handling a large amount of data in a local variable. It is now more important to be aware that you could be moving mountains of data in some very innocent looking TSQL. We must make sure we don't use these powerful capabilities inappropriately. "With great power comes great responsibility" - Spiderman!

In BOL you'll notice that a distinction is drawn between large value types and large object data types. The new MAX types are part of the large value types and the TEXT / NTEXT / IMAGE family are part of the large object types. The only distinction I can find between the two that warrants this separation is the use of the infamous text pointer that was used with large object types. So when you come across large value versus large object types in BOL, this is what they mean.

TEXT, NTEXT and IMAGE are notoriously difficult to use within the database (i.e. within TSQL or stored procedures). It is not possible to use TEXT, NTEXT and IMAGE data types as local variables. SQL Server allows us to use a level of indirection through a 'text pointer'. Essentially this is a handle to a TEXT / NTEXT / IMAGE column of a specific row. In order to access the data in the column we first have to:

  • read the text pointer of the column of the row we were interested in (using the TEXTPTR function), then
  • use the text pointer as a parameter to the READTEXT, WRITETEXT or UPDATETEXT functions.

In addition, we are responsible for controlling the locking / isolation of the text pointer from changes while we are reading it through the use of HOLDLOCK locking hints and within transactions. This prevents anyone from changing the data while we are 'dereferencing' the text pointer. Alternatively you can just SELECT the data into VARCHAR / NVARCHAR / VARBINARY local variables and risk truncating the data. You can also SELECT the column INTO another table that has TEXT / NTEXT / IMAGE columns.

Whilst this sounds like many hoops to jump through, it is actually beneficial as developers resort to handling TEXT / NTEXT / IMAGE data outside the database. Managing these data types from ADO.NET / OLE DB is far easier than trying to manipulate the data within the database. The data can be extracted in plain SELECT statements (instead of READTEXT) and be streamed back to the client in chunks. In addition, this is a far more scalable alternative as it doesn't consume precious memory resources on the SQL Server.

New to SQL Server are the 'MAX' versions of VARCHAR / NVARCHAR / VARBINARY. There are the replacements for TEXT / NTEXT / IMAGE that have now been marked as deprecated and will be removed in a future version of SQL Server. These 'MAX' versions are meant to address the in-database manipulation difficulties of their predecessors. They are also designed to be able to handle small to very large data sizes efficiently without having to choose between VARCHAR / NVARCHAR / VARBINARY and TEXT / NT