Fotia Ltd

Fotia Ltd

Friday, 1 February 2008

Partitioned Views on a Budget

posted by Stefan Delmarco

Partitioned views are a great feature of SQL Server. By injecting an additional layer of abstraction between your logical view and physical implementation they give you fine grained control over data partitioning. However, the cover charge required for entry to this exclusive club is pretty high. The checklist required to implement a partitioned view can often interfere with your design to such a degree that they become more of a hindrance than a benefit. However, by decomposing the features SQL Server uses to implement the complete partitioned views specification we can get most of the benefits without meeting all of the criteria.

One of the benefits of partitioned views is the query optimizer's ability to determine which subset of the member tables need to be queried for each access to the partitioned view. The SQL Server query optimizer has clever features built into it that are able to understand variables and constants in TSQL and relate those constants to search predicates. The optimizer is able to use that insight to short-circuit query plan execution and not bother with branches that will never execute. For example:

declare @constant int;
set @constant = 0;

select *
from Person.Contact
where 1 = @constant;

The query plan for this bit of TSQL looks like this:

The SQL Server query optimizer has introduced a filter into the query plan that has a special property called a Startup Expression Predicate. This property is the short-circuit logic that determines whether or not the clustered index scan on the Person.Contact table executes or not. If the startup expression evaluates to FALSE then the optimizer doesn't bother executing any portion of that query. In this example, SQL Server will do nothing when @constant = 1 evaluates to FALSE. It is the exact same technology that partitioned views use to determine which underlying member tables to access based on the member tables' check constraints / partitioning columns.

For example, consider a situation where I would like to query a table that exists in a number of databases. I would like to have a partitioned view that aggregates these member tables in a single view. However, I don't want to pay the penalty of having to query every member table for every query. If I specify the database name in the SELECT query, I want the optimizer to use its short-circuit feature to go directly to the one table that will satisfy the search arguments. The solution is easier than you think!

create database Database01;
create database Database02;
create database Database10;
create database Database11;
go

use Database01
create table MemberTable (
   Col1 varchar(10));
go

use Database02
create table MemberTable (
   Col1 varchar(10));
go

use Database10
create table MemberTable (
   Col1 varchar(10));
go


use Database11
create table MemberTable (
   Col1 varchar(10));
go

insert Database01.dbo.MemberTable values ('Data01');
insert Database02.dbo.MemberTable values ('Data02');
insert Database10.dbo.MemberTable values ('Data10');
insert Database11.dbo.MemberTable values ('Data11');
go

use tempdb;
go
create view BudgetView as
select 'Database01' as SourceDatabase, Col1
from Database01.dbo.MemberTable
union all
select 'Database02', Col1
from Database02.dbo.MemberTable
union all
select 'Database10', Col1
from Database10.dbo.MemberTable
union all
select
'Database11', Col1
from Database11.dbo.MemberTable
go

select *
from BudgetView;

SourceDatabase Col1
-------------- ----------
Database01 Data01
Database02 Data02
Database10 Data10
Database11 Data11

(4 rows(s) affected)

The query plan for a SELECT of all of the rows shows SQL accessing each member table in turn:

This is the clever bit. We've put the partitioning criteria that separates each member table into the view's definition. We've hardcoded the database name of each member table into the first column of the view. The really clever bit is that SQL Server understands this! Try the following:

select *
from BudgetView
where SourceDatabase = 'Database01';

SourceDatabase Col1
-------------- ----------
Database01 Data01

(1 row(s) affected)

SQL Server has looked at the search argument we specified for SourceDatabase and short-circuited the query plan. It only queried Database01's MemberTable and didn't bother accessing any of the others!

It gets even better, SQL Server understands all manner of comparisons, including LIKEs, in this example it only access databases where the name starts with Database0.

select *
from BudgetView
where SourceDatabase like 'Database0%';

SourceDatabase Col1
-------------- ----------
Database01 Data01
Database02 Data02

(2 row(s) affected)

The story with parameters is also good! The startup filters we saw earlier make an appearance:

declare @sourceDatabase varchar(20);
set @sourceDatabase = 'Database10';

select *
from dbo.BudgetView
where SourceDatabase = @SourceDatabase;
SourceDatabase Col1
-------------- ----------
Database10 Data10

(1 row(s) affected)

At execution time SQL Server will dynamically determine which branches of the query plan to short-circuit. This is great as it allows a single query plan to be re-used for any value of @sourceDatabase and still deliver the short-circuits benefits.

We've managed to create a view that has the same short-circuit benefits as a pukka partitioned view without having to adhere to all the partitioned view requirements. Our base tables don't have primary keys defined and there are no check constraints. On the downside our view is not updateable, it cannot be inserted into, updated or deleted. If you need all of these features you can either implement INSTEAD OF triggers or go with fully-fledged partitioned views. In this example we used the partitioning column to identify the source database name. Constant partitioning columns are also useful for soft-delete / archive tables, coarse categories, or chunky time periods.

Labels: ,

Monday, 23 April 2007

The NEWSEQUENTIALID Function

posted by Stefan Delmarco

The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance and globally unique values.

Insert Performance and Uniqueness

The best insert performance will always be achieved by using an identity column as the clustering key. With a monotonically increasing identity column as the clustered index, data is always inserted at the end of the table. At a basic level this is equivalent to the performance of a table that has no clustered index (i.e. a heap) where the insert position for the next row is the page where the last row was inserted. Secondly, there will be no page splits. With a normal clustered index, SQL Server will move half the rows on a full page to a newly allocated page in order to make space for the new row. With an identity column on a clustered index SQL Server will not 'split' the page in this manner. It will just allocate another page and continue inserting rows. It has enough smarts to know not to copy half the rows to the newly allocated page as all new rows will be inserted into the new page.

The limitations of identity columns come to the fore when you need have unique keys that span either tables, databases or SQL Servers. For example, say you have multiple branches that can each generate orders. Each branch has a replicated copy of the Orders table, however you don't want duplicate order numbers generated across branches. Common solutions to this problem include creating a composite order number that includes the branch id, reserving ranges of the identity column or making use of a central 'number fountain'. Each of these approaches introduce a level of indirection that must be adhered to before a row can be created. If the order number is not required to be human-readable friendly, GUIDs are often used instead.

GUIDs

GUIDs (Globally Unique Identifiers) are pseudo-random 16-byte (128-bit) numbers whose generation algorithm provides a sufficiently high degree of probability that the same GUID will never be generated twice on any computer at any time. Although a GUID is a binary array, it is often represented in its hexadecimal form using the following format, for example: dbbc2827-edf8-4a2d-92ad-c1e0059304d7. This makes them suitable for the 'distributed uniqueness requirement' of our Orders table.

The problem would be pretty much solved there unless we have some non-trivial performance requirements, especially around insert volumes and speed. As GUIDs are effectively pseudo-random (the degree of randomness / predictability was changed by Microsoft in response to privacy concerns) they do not possess the same orderly insert benefits as identity columns when used as the clustering key. The insert IO pattern changes from sequential to random-access when using GUIDs instead of an IDENTITY column. Not only do GUIDs require far more IO to locate the insert location in the clustered index (the index has to be traversed from the root for every insert), fragmentation at the leaf level leads to less dense pages, which in turn requires defragmentation / reindexing.

NEWSEQUENTIALID()

Enter NEWSEQUENTIALID(), stage left. This is a new system function included in SQL Server 2005 that combines the best of both worlds. NEWSEQUENTIALID() will generate a GUID that will be greater in value than the previously generated one.

It is easy to see how this function works by looking at the output the following TSQL script generates:

create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int);
go

-- Insert 100 rows.
declare @count int;
set @count = 0;
while @count < 100 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

select *
from TestTable
order by id;
go

id sequence
------------------------------------ -----------
FA780E3B-03C2-DB11-BD9F-0011D82F3F23 0
FB780E3B-03C2-DB11-BD9F-0011D82F3F23 1
FC780E3B-03C2-DB11-BD9F-0011D82F3F23 2
FD780E3B-03C2-DB11-BD9F-0011D82F3F23 3
FE780E3B-03C2-DB11-BD9F-0011D82F3F23 4
...  
5C790E3B-03C2-DB11-BD9F-0011D82F3F23 98
5D790E3B-03C2-DB11-BD9F-0011D82F3F23 99

You'll notice that there is a clear bit-shifting pattern that SQL Server is employing to keep the GUIDs increasing in relative value. These are the restrictions in NEWSEQUENTIALID's use:

  • NEWSEQUENTIALID cannot be used in arbitrary TSQL statements. It can only be specified as the DEFAULT value for an uniqueidentifier column
  • More than one column in a table can use NEWSEQUENTIALID
  • It cannot be combined with other scalar functions e.g. REVERSE(NEWSEQUENTIALID()) is not allowed

The network card's MAC address is easily identifiable in the generated GUID. My laptop's MAC address is

   00-01-4A-28-64-8B

This is one of the NEWSEQUENTIALID GUIDs that was generated:

   4EAC7708-30C3-DB11-B902-00014A28648B

Hence the privacy warnings in BOL:

NEWSEQUENTIALID() Privacy Warning

Correlating the position of the MAC address in the GUID with the bits that get incremented / shifted for each successive GUID sheds light on why the GUIDS remain globally unique. The MAC address portion of the GUID does not change.

Performance Comparison

The following test demonstrates the benefits of NEWSEQUENTIALID. The insert performance of the various clustering key variations was measured and compared. For each test the same TestTable was recreated with a different Id column definition:

  • IDENTITY(,)
  • NEWID()
  • NEWSEQUENTIALID()

The following TSQL contains the DDL used to construct the tables for the 3 clustered keys:

-- IDENTITY
create table TestTable (
   id int identity(1,1) not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWID
create table TestTable (
   id uniqueidentifier default newid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

-- NEWSEQUENTIALID
create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');
go

For each test 50,000 rows were inserted into TestTable table using the following TSQL script:

-- Insert 50,000 rows.
declare @count int;
set @count = 0;
while @count < 50000 begin
   insert TestTable (sequence)
   values (@count);

   set @count = @count + 1;
end;
go

The following metrics were gathered after each run:

-- Get the number of read / writes for this session...
select reads, writes
from sys.dm_exec_sessions
where session_id = @@spid;

-- Get the page fragmentation and density at the leaf level.
select index_type_desc, index_depth, page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent, record_count
from sys.dm_db_index_physical_stats(db_id(), object_id('TestTable'), null, null, 'detailed')
where index_level = 0;
go

The results are quite compelling:

  Reads Writes Leaf Pages Avg Page Used Avg Fragmentation Record Count
IDENTITY(,) 0 1,683 1,667 98.9% 0.7% 50,000
NEWID() 0 5,386 2,486 69.3% 99.2% 50,000
NEWSEQUENTIALID() 0 1,746 1,725 99.9% 1.0% 50,000

Most striking is the number of writes required by the NEWID system function. This, coupled with the average page density of 69%, is evidence of the page splitting caused by the random distribution of inserts at the leaf level. As soon as a page fills up, it needs to be split into 2 pages of 50% each for the insert to complete. Not only has page splitting resulted in poor page density, it has fragmented the data pages quite badly (there is a 99% probability that the next data page is not next to the current one). In our tests the most likely place for a free page required for the page split is at the end of the table irrespective of where the row is being inserted. Therefore to read the rows in order the scan needs to keep jumping back and forth between widely distributed split pages, hence the appalling fragmentation.

The minor difference in page count between IDENTITY and NEWSEQUENTIALID metrics is due to the difference in size between IDENTITY's INT (4 bytes) and NEWSEQUENTIALID's UNQUEIDENTIFIER (16 bytes). However, note that any non-clustered index on the UNIQUEIDENTIFIER table will be 4 times larger than the same index on the IDENTITY table. Therefore IDENTITY is still preferred over NEWSEQUENTIALID when choosing between the two.

The NEWSEQUENTIALID system function has clearly lived up to its claim of providing GUID-like uniqueness coupled with identity-level insert performance. The number of writes, fragmentation and page density are all inline with identity-level metrics. These benefits make NEWSEQUENTIALID a compelling feature.

Labels: ,

Monday, 12 March 2007

When a String Is Not a String

posted by Stefan Delmarco

This article was inspired by a seemingly innocent bit of code that caused a performance problem in a very unexpected way. It is the kind of code that you'd find in many samples. Unfortunately, it has a potential flaw that is only exposed when large data volumes are present.

The Problem Code

The Data Access Layer (DAL) code in question is as follows:

public static class AccountDal
{
   public static string GetTerritoryForAccount(string accountNumber)
   {
      using(SqlConnection con = new SqlConnection(@"Server=localhost;Database=AdventureWorks;
         Integrated Security=SSPI"
))
      using(SqlCommand cmd = con.CreateCommand())
      {
         cmd.CommandText = @"
            select t.Name
            from Sales.Customer c
               inner join Sales.SalesTerritory t
                  on c.TerritoryID = t.TerritoryId
            where c.AccountNumber = @AccountNumber"
;
         cmd.CommandType = CommandType.Text;
         cmd.Parameters.AddWithValue("@AccountNumber", accountNumber);

         con.Open();
         return cmd.ExecuteScalar() as string;
      }
   }
}

Very simple stuff. Open a connection to SQL, select a specific row from the Sales.Customer table, join the resultant row to the Sales.SalesTerritory table and return the name of the territory. The code is making diligent use of parameters for reusable query plans, strong typing and to prevent SQL injection attacks. The following test exercises the code:

[TestClass]
public class AccountDalTests
{
   [TestMethod]
   public void CanGetTerritory()
   {
      string territory = AccountDal.GetTerritoryForAccount("AW00000010");

      Assert.AreEqual(territory, "Canada");
   }
}

This test passes, so functionally the code works as expected. Running this bit of code on a decently spec'd developer workstation takes about 800 ms. Running it repeatedly doesn't improve on this figure. The Sales.Customer table contains 19,185 rows. Given that we're looking up a single row in the Sales.Customer row the alarm bells should be ringing, especially as the Sales.Customer has a nonclustered, unique index on AccountNumber:

exec sp_helpindex 'Sales.Customer';

index_name index_description index_key
-------------------------- ----------------------------------------------------- --------------------
AK_Customer_AccountNumber nonclustered, unique located on PRIMARY AccountNumber
AK_Customer_rowguid nonclustered, unique located on PRIMARY rowguid
IX_Customer_TerritoryID nonclustered located on PRIMARY TerritoryID
PK_Customer_CustomerID clustered, unique, primary key located on PRIMARY CustomerID

Diagnosing the Issue

Capturing the TSQL being executed in SQL Profiler and running it through Query Analyzer shows the following query plan. The query plan shows exactly where all the time is being spent: Original Query Plan The IX_Customer_TerritoryID index is being used. However, we're performing an index scan of all the AccountNumbers instead of just an index seek to the specific row we're returning. The compute scalar step following the index scan shows precisely why. Expr1004 is the output of an implicit conversion of every AccountNumber to NVARCHAR(10). Looking at the definition of the Sales.Customer table shows that AccountNumber is, in fact, a VARCHAR(10):

select column_name, data_type, character_maximum_length
from information_schema.columns
where table_catalog = 'AdventureWorks'
 and table_schema = 'Sales'
 and table_name = 'Customer';

column_name data_type character_maximum_length
-------------- ------------------- ------------------------
CustomerID int NULL
TerritoryID int NULL
AccountNumber varchar 10
CustomerType nchar 1
rowguid uniqueidentifier NULL
ModifiedDate datetime NULL

Looking at the TSQL the application sends to SQL Server, we can see that it is binding @AccountNumber as an NVARCHAR(10):

exec sp_executesql N'
   select t.Name
   from Sales.Customer c
      inner join Sales.SalesTerritory t
         on c.TerritoryID = t.TerritoryId
   where c.AccountNumber = @AccountNumber'
,
   N'@AccountNumber nvarchar(10)',
   @AccountNumber=N'AW00000010'

So the application is binding an NVARCHAR(10) parameter that is then being compared to a VARCHAR(10) column. SQL Server then converts every row of the VARCHAR(10) column to an NVARCHAR(10) data type. Clearly this is where the performance problem is. The scalability of this query is therefore linearly dependant on the number of rows in the table. The more rows there are, the more values need to be converted. This not a good situation to be in.

Data Type Precedence

Whenever SQL Server is asked to compare two differing data types, it uses its data type precedence rules to determine which data type needs to be converted to perform the comparison. Assuming an implicit conversion exists, the data type that appears lower in the list is converted to the data type that appears higher in the list. SQL Server's Data Type Precedence order has NVARCHAR appearing higher than VARCHAR. Therefore, all VARCHAR data types need to be converted to NVARCHAR in order to evaluate the AccountNumber values.

Unfortunately, in our case, we have ~20,000 VARCHARs to convert to NVARCHAR and 1 NVARCHAR @AccountNumber! This is the reason for the index scan in the query plan. Note that with SQL Server 7.0, the query optimiser would always convert the literal and never the column. SQL Server 2000/5 strictly adheres to the Data Type Precedence rules, even if it means having to convert 19,185 rows. This change in behaviour is documented in KB 271566.

Looking back at the original C# code, we didn't specify a SqlDbType when adding the @AccountNumber parameter. The SqlParameter class will infer the SqlDbType from the object value specified if it is not specified explicitly. A System.String is always interpreted as SqlDbType.NVarChar as all .NET strings are Unicode.

There are a number of possible solutions to this problem:

  • Change the C# code to have knowledge of the SqlDbTypes the parameters are binding to
  • Standardise by only allowing inferred SqlDbType types to be used in your database

Personally, I prefer the first option by a country mile. The second option prevents you from designing a data model that fully exploits all of the advantages of the SQL Server domain. For example, disallowing VARCHARs will double the amount of storage required to store your VARCHARs. I prefer to use a data type dictionary.

Data Type Dictionary

Data dictionaries are normally only used within databases. However, introducing a data type dictionary in the DAL allows you to have fine-grained control over your data types as well as other data type facets such as length and precision. In addition, it provides for better type safety as users of the data dictionary can only pass a .NET string to get an AccountNumber SqlParameter object. Any other .NET would obviously be a compile time error. Lastly, should you ever have to change data types (e.g. accountNumber becomes a System.Int32) it would be easy to make the change to the GetAccountNumber method and get compile time errors where string accountNumbers are still being used.

public static class SqlDictionary
{
   public static SqlParameter GetAccountNumber(string accountNumber)
   {
      SqlParameter parameter = new SqlParameter("@AccountNumber", SqlDbType.VarChar, 10);
      parameter.Value = accountNumber;

      return parameter;
   }
}

With this change, the query plan now looks far more respectable: Updated Query Plan

Relying on inferred Object / Relational data mappings can lead to unexpected issues. This is a classic example of the ORM-impedance mismatch that so many tools try to abstract away. However, by utilising a simple data dictionary, many of these problems can be easily solved by writing disciplined code.

Default Implicit Conversion

SQL Server 2005 has mediated the data type precedence trap somewhat. For example, in SQL Server 2000, comparing a BIT column to a literal 1 always resulted in the BIT column being upcast to a TINYINT. Whilst this may sound a bit bizarre, the problem was that the literal number 1 was always implicitly assigned the TINYINT data type and TINYINT appears above BIT in the data type precedence rules. SQL Server 2005 now applies a bit more intelligence. It assigns the literal the same data type as the column so no conversion is required.

So, if you're still using SQL Server 2000, be sure to specify your BIT literals as 0x1. Whilst this doesn't equate to a BIT data type, its BINARY datatype is lower on the data type precedence rules and is therefore implicitly upcast to a BIT data type. Using this standard you'll never fall into the trap of upcasting all your BITS to TINYINTs. Note that if you run a SQL Server 2005 database in 80 compatability level you'll get the old SQL Server 2000 behaviour.

-- SQL Server 2000 implicit conversion issue.
-- SQL Server 2005 OK.

select ProductID
from Production.Product
where FinishedGoodsFlag = 1;

-- SQL Server 2000 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 0x1;

The implicit conversion is quite easy to spot in the query plan.

Convert SQL 200 vs 2005

Labels: ,

Thursday, 1 March 2007

KB Article Hidden Gems

posted by Stefan Delmarco

One habit that has paid off repeatedly has been to loyally scan through PSS's SQL Server 2005 KB Article RSS Feed on a daily basis. There isn't a huge amount of volume; some days as many as 5 new articles, most days none. It works very well as my SQL Server gossip column. The KB articles tell me what the latest issues are and what information is considered critical enough by PSS to warrant a KB article.

For example, yesterday the following KB article was published:

FIX: I/O requests that are generated by the checkpoint process may cause I/O bottlenecks if the I/O subsystem is not fast enough to sustain the IO requests in SQL Server 2005

If you take the time to scroll to the very end of the article, you'll find the More Information section. This is where the really good stuff is kept. This specific hotfix is not what you would expect. The hotfix adds an additional SQL Server command line option to throttle the I/O requests to the disk subsystem. I/O subsystem issues are common when dealing with SQL Server so knowing about this specific issue and the existence of the hotfix will give you some options when investigating I/O subsystem issues.

Labels:

Monday, 20 November 2006

The TABLESAMPLE Clause

posted by Stefan Delmarco

When I first same across TABLESAMPLE I had visions of some advanced index hints that allowed fine grained control over table statistics (distribution, ranges, density, etc.). Unfortunately not. As per Occam's Razor the simplest explanation is the correct one. It just... returns a sample from the specified table...?

The complete syntax for TABLESAMPLE is as follows (from BOL):

TABLESAMPLE

Given this syntax, the most obvious query to test drive TABLESAMPLE would be to return, say, 10 rows from the Sales.SalesOrderHeader table. After all, this syntax looks a lot like TOP:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (10 rows);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

How unexpected is that! Zero rows returned?! Huh? For the sake of adding to the confusion, change the query to return 1,000 rows as follows:

select SalesOrderId, OrderDate, TotalDue
from Sales.SalesOrderHeader tablesample system (1000 rows);

Then run the query a few times and observe how many rows get returned:

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44137 2001-09-01 00:00:00.000 3953.9884
...

(938 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44407 2001-10-16 00:00:00.000 3729.364
...

(1263 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
44757 2001-12-01 00:00:00.000 49953.7086
...

(1085 rows(s) affected)

Each execution returns a different number of rows! Some of them just below 1,000 and others just above. Clearly there is a non-deterministic factor at play here. Without any data changing, re-running the identical query keeps giving different results. Digging into the BOL documentation shows that the TABLESAMPLE algorithm has a very strong statistical slant to it. It works as follows:

For the table affected by the TABLESAMPLE, SQL Server needs to determine a probability that a page of that table's data will be sampled. This probability is equal to the PERCENTAGE specified in the TABLESAMPLE. Alternatively, if a number of ROWS was specified, that number is divided by the total number of rows in the table to calculate the probability. In our case, the Sales.SalesOrderHeader table has 31,465 rows. We asked for 1,000 rows in the second query. Therefore the probability is calculated as 3.2%.

Once this probability has been calculated, SQL Server will evaluate it for every page of the table. The details of the algorithm aren't documented but it is along the lines of:

  • For each page, a random number between 0 and 1 is generated (e.g. RAND()).
  • If the random number is greater than the probability, SQL Server will pretend that the page for the table does not exist (i.e. it is not sampled / 'out-of-scope').
  • Otherwise, SQL Server considers all of the rows on the page sampled / 'in scope'.

This is very revealing as it explains the behaviour we saw with the first query. We asked for a sample of 10 rows. SQL Server therefore calculated the probability of a page being returned as:
10 / 31,465 = 0.032 %, a very small number. If we execute the query once, there is a chance SQL Server will not generate a random number that falls within this small probability. However, we can increase our chances of getting at least one page returned by executing the select statement a number of times:

declare @loopCount int;
set @loopCount = 0;

declare @rowcount int;
set @rowCount = 0;

-- Keep looping till we manage to sample at least one page.
while @rowCount = 0 begin
    select SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader tablesample system (10 rows);

    set @rowcount = @@rowcount;
    set @loopCount = @loopCount + 1;
end;

raiserror('Required %d loops to return %d rows', 0, 1, @loopCount, @rowCount);

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------

(0 rows(s) affected)

SalesOrderId OrderDate TotalDue
------------ ----------------------- ---------------------
69998 2004-05-06 00:00:00.000 2649.8453
...

(46 rows(s) affected)

Required 5 loops to return 46 rows

Interestingly enough, when we get back some rows (i.e. we get a probability hit) we get 91 rows returned instead of 10. This reveals another facet of the TABLESAMPLE algorithm. It only goes as far as the probability calculation of a page for sampling data. Once it has decided that a page of the table is to be included in the sample, it will not limit the number of rows sampled from that page. Therefore you are always liable to get back the number of rows equivalent to the page size divided by the row width (assuming, of course, that the pages are densely packed). Running the above query repeatedly returns 40-45 rows, indicating that a page of the Sales.SalesOrderHeader table typically holds 40-45 rows per page.

A couple of additional points to note:

  • The TABLESAMPLE is applied before any WHERE clauses are evaluated.
  • This includes any JOIN conditions as well. Some misleading results can be obtained if two tables that are JOIN'd have a TABLESAMPLE applied to them. As the sampling of the tables are independent and uncorrelated, there is no guarantee that related rows from both tables will be included.
  • The TOP clause can be combined with TABLESAMPLE to cap the number of rows returned. We saw earlier how the TABLESAMPLE algorithm can return more rows than requested. However, there is still every chance that less rows will be returned than was specified in the TABLESAMPLE.

The REPEATABLE option of the TABLESAMPLE clause basically causes SQL Server to use the same random numbers when evaluating each of the table's pages again (SQL Server is probably using a linear congruential generator to obtain the random numbers with the same seed). The net effect is that the same pages will be sampled assuming no underlying data has been changed.

In terms of practical applications of the TABLESAMPLE algorithm, I really struggled to think of a situation where it would have been useful. At face value, all that TABLESAMPLE does is return a certain number or percentage of 'random-ish' rows from the specified table. Unless you have some highly-specialised requirement to get a 'flavour' for the data in the table, this new clause isn't that useful. It doesn't augment the TOP / ROWCOUNT clause at all and doesn't expose any customisable or extensible behaviour.

I strongly suspect that the TABLESAMPLE clause was added specifically for the optimiser's gathering of statistics. Anyone who has seen STATMAN scrolling past in a SQL Profiler trace will know that the update of statistics uses a number of undocumented TSQL operators and index hints. For example:

select SalesOrderId, OrderDate, TotalDue
from (
    select top 100 percent SalesOrderId, OrderDate, TotalDue
    from Sales.SalesOrderHeader with (readcommitted, sample 10e-2 percent)
    order by 1, 2, 3)
option (bypass optimizer_queue, maxdop 1);

I count at least 2 undocumented operators and 1 view ordering workaround being used in this query. TABLESAMPLE is definitely a much cleaner implementation for a query that appears to be performing the same function.

Who knows, maybe one day your boss will ask your team: "You know what, it would be really cool if we could return random pages from this table, according to a probability we specify, get back different results every time we execute it, and even return vastly different numbers of rows than what we asked for". You'll be the first one to raise your hand...

Labels: ,

Friday, 13 October 2006

The UNPIVOT Operator

posted by Stefan Delmarco

Completing the discussion on the PIVOT operator is its antonym, the UNPIVOT. However, I would argue that the UNPIVOT operator has been misnamed by Microsoft as it creates the mistaken impression that PIVOT -> UNPIVOT -> PIVOT gets you back to where you started. From BOL:

UNPIVOT performs the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

It fails to highlight that fact that PIVOT has aggregated the raw data in order to rotate the underlying rows into columns - you have to read to the very end of the Using PIVOT and UNPIVOT sample for the fine print in BOL. In contrast, the UNPIVOT operator does not perform any aggregation whatsoever. It just turns columns into rows. For each column in the table being UNPIVOT'd you get a row in the result set where the row values are ColumnName and ColumnValue. It therefore helps to think of transpose instead of UNPIVOT as this is more accurate description of the UNPIVOT operator's behaviour.

Nevertheless, let's have a look at the syntax to try and get a better feel for what UNPIVOT brings to the table. As a worked example, consider the following requirement: Write a query to produce a result set that contains a list of products (Production.Product) and their SellStartDate and SellEndDate. The SellStartDate and SellEndDate need to be listed in a single column with a second column specifying whether it is the SellStartDate or SellEndDate value being specified. Lastly, the result set needs to be sorted by this consolidated date column to produce a linear timeline of 'product events'.

In SQL Server 2000, you'd probably end up with a query that looks very much like the following:

select ProductId, [Name], 'SellStartDate' as DateType, SellStartDate as DateValue
from Production.Product
union all
select ProductId, [Name], 'SellEndDate' as DateType, SellEndDate as DateValue
from Production.Product
where SellEndDate is not null
order by DateValue;

ProductId Name DateType DateValue
----------- ----------- ----------- -----------
...
320 Chainring Bolts SellStartDate 1998-06-01 00:00:00.000
321 Chainring Nut SellStartDate 1998-06-01 00:00:00.000
322 Chainring SellStartDate 1998-06-01 00:00:00.000
323 Crown Race SellStartDate 1998-06-01 00:00:00.000
324 Chain Stays SellStartDate 1998-06-01 00:00:00.000
325 Decal 1 SellStartDate 1998-06-01 00:00:00.000
326 Decal 2 SellStartDate 1998-06-01 00:00:00.000
327 Down Tube SellStartDate 1998-06-01 00:00:00.000
328 Mountain End Caps SellStartDate 1998-06-01 00:00:00.000
329 Road End Caps SellStartDate 1998-06-01 00:00:00.000
330 Touring End Caps SellStartDate 1998-06-01 00:00:00.000
...

Using the UNPIVOT operator, in SQL Server 2005, the equivalent is as follows:

select ProductId, [Name], DateType, DateValue
from
    (select ProductID, [Name], SellStartDate, SellEndDate
    from Production.Product) as rawData
unpivot
    (DateValue for DateType in
       (SellStartDate, SellEndDate)) as transposed
order by DateValue;

The key to understanding the UNPIVOT syntax is to see how the DateValue and DateType columns are specified. When transposing the columns into rows, they effectively become name / value pairs. The name is the name of the column and value is the value of the column for the relative row. BOL calls it value_column and pivot_column but just think of them as name / value pairs.

The first portion of the query selects the raw data which we will be transposing. The UNPIVOT portion of the query requires the name of the columns to be used for the name and value pair columns (Value FOR Name) followed by the columns that will be transposed. Lastly, the outermost select allows you specify which transposed columns you want to return, so you can choose not to return the Name column if you don't want it.

As the SELECT...CASE statements were able to duplicate all the functionality of the PIVOT operator, SELECT...UNION ALL can perform all the same tricks as UNPIVOT. However, the advantage again is a more compact and expressive syntax.

One behaviour of UNPIVOT to watch out for is that it will not transpose column values that are NULL. NULL values will not appear as name / value pairs in the results. In the sample above I've compensated for this behaviour by including where SellEndDate is not null in the SQL Server 2000 equivalent query. Note that there is way to change this behaviour.

In terms of performance, the query optimiser highlights a significant difference between the SELECT...UNION ALL and UNPIVOT queries. The former requires a pass over the underlying table for each select (2 passes in our sample) whereas the UNPIVOT requires only a single pass. SQL Server is able to transpose the columns into rows for each column in a single pass over the underlying data. This makes the UNPIVOT query less expensive than the equivalent UNION ALL. The size of this penalty is proportional to the number of rows being processed. This fact along makes UNPIVOT a very useful tool to keep note of in your TSQL arsenal.

Labels: ,

Sunday, 1 October 2006

SPID Blocks Itself...?

posted by Stefan Delmarco

Came across this useful KB the other day: 906344, The blocked column in the sysprocesses table is populated for latch waits after you install SQL Server 2000 SP4. The technical detail may be a bit confusing. Essentially the SPID is doing a convoluted "WaitForSingleObject"-type block, waiting for a data page to be fetched into memory from disk. If you see a lot of this type of blocking you may want to check your other performance counters as you may have poor performing IO or buffer pool memory pressure.

Labels:

Saturday, 23 September 2006

Microsoft SQL Server Development Team Blogs

posted by Stefan Delmarco

The best source for SQL Server internals and advanced TSQL are the Microsot SQL Server development teams' blogs. Some are more active than others, but they're all worth subscribing to for that gem you don't want to miss. For example, the SCHEMABINDING tip is a real nugget.

Labels:

Friday, 22 September 2006

For Database Professionals

posted by Stefan Delmarco

Visual Studio 2005 Team Edition for Database Professionals CTP 5 - Beta, download here. An update patch for CTP 5 already available

Labels: ,

The PIVOT Operator

posted by Stefan Delmarco

SQL Server 2005 introduces the PIVOT operator to the TSQL stable. The intention of this operator is to make it easier to transform unique values in rows into columns and aggregate the resulting data. Anyone familiar with Excel pivot tables or Access cross-tab queries will immediately be able to visualise what pivot tables look like. Prior to SQL Server 2005, the only way to transform data, in the manner provided by PIVOT, was to use a number of SELECT...CASE statements. As we'll see, the PIVOT operator has not given us functionality that was not possible before. However, it has made pivoting data much easier to perform and far more readable.

In order to understand how to form a PIVOT query, it is important to be able to describe, in plain English, what the query is trying to accomplish. Personally I have found that thinking of the PIVOT result set in terms of a graph (i.e. a X- and Y- axis) makes it easier to formulate into TSQL.

As a worked example, consider the following requirement. You need to generate a report that will correlate the number of addresses that were changed per city for each year. In other words, formulate a result set that will have each city listed as a column (X-axis). For each city, count the number of times an address in that city was modified in a given year (Y-axis). In SQL Server 2000, you'd probably create some TSQL similar to the following:

select datepart(yyyy, modifiedDate) as [WhenUpdates],
    sum(case city when 'Monroe' then 1 else 0 end) as [Monroe],
    sum(case city when 'Redmond' then 1 else 0 end) as [Redmond],
    sum(case city when 'Snohomish' then 1 else 0 end) as [Snohomish],
    sum(case city when 'Issaquah' then 1 else 0 end) as [Issaquah]
from Person.Address
group by datepart(yyyy, modifiedDate)
order by 1;

WhenUpdates Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

A couple of things to note:

  • Even though there are many more cities in the Person.Address table than the ones specified, we must 'hardcode' the city names into the query. There is no way to dynamically create additional columns if a new city is encountered. That fact alone makes this query difficult to maintain. If another city is added to Person.Address, this query would have to be modified.
  • The Y-axis (WhenUpdated) column is dynamic, we do not require prior knowledge of the modifiedDate's content when formulating the query. When an address is modified in a new year (e.g. 2005), another row will automatically be added to this result set without any changes.

Now, for the PIVOT operator. There is good news and bad news. The good news is that the syntax is far more compact and readable. The bad news is that it overcomes none of the limitations mentioned above! However, as we'll see, the good news is still worthwhile...

A PIVOT query is split into a number of sections. First, you need to write the SELECT statement that provides you with the raw, non-pivoted data. In our example, the raw data would be the following query:

select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
from Person.Address;

WhenUpdated AddressID City
----------- ----------- ------------------------------
...
2004 19 Bothell
1999 20 Bothell
1999 21 Bothell
2001 22 Portland
1999 23 Seattle
2001 24 Duluth
2002 25 Dallas
2001 26 San Francisco
...

WhenUpdated is the Y-axis, AddressID is a data point - we're going to count AddressIDs for a given (X,Y) / (City,WhenUpdated) combination - and City is the X-axis. Once we have the raw data, it is simple case of articulating which column is used for what function in the pivot:

select Pvt.*
from (
    -- The raw data as a derived table...
    select datepart(yyyy, modifiedDate) as [WhenUpdated], AddressID, City
    from Person.Address) as RawData
pivot (
    count(AddressID)  -- How each X,Y point is calculated...
    for City          -- The X-axis, stated explicitly...
    in ([Monroe], [Redmond], [Snohomish], [Issaquah])) as Pvt
order by 1;

WhenUpdated Monroe Redmond Snohomish Issaquah
----------- ----------- ----------- ----------- -----------
1996 1 0 0 0
1997 0 1 0 0
1998 2 2 1 1
1999 11 12 7 13
2000 0 3 2 0
2001 0 5 0 6
2002 1 9 0 13
2003 0 45 0 42
2004 1 44 0 45

At first glance the PIVOT section seems a bit unintuitive. However, if you work through the syntax it will start to make sense. The PIVOT operator needs to know how to calculate each (X,Y) data point's value, specified as an aggregate function. The FOR column contains the X-axis values and the IN contains the hard-coded list of X-axis columns. As you can see, there is no getting around the SELECT...CASE limitation of having to know the transformed columns before-hand. In case you are wondering: no, IN cannot take a wildcard, e.g. IN(*). I believe that the limitation exists due to the optimizer's requirement to have fixed metadata when the execution plan is generated. This would not be possible if the column metadata is dynamically generated from the data itself.

The Y-axis is inferred as the distinct set of all columns that have not been used by the aggregate function or by the X-Axis. In our case, that would be the WhenUpdated computed column.

In terms of execution plans, the SELECT...CASE and PIVOT produce almost identical plans. However, as we're throwing away more of the rows in the Person.Address table (only 267 of the 19,614 rows are for the cities in our X-axis), both queries would be far more efficient if they included the set of X-axis column values (i.e. the names of the cities) in their WHERE clauses.

Instead of trying to dream up dynamic-SQL ways of working around this problem (non-dynamic columns and rows falling outside the predefined X-axis values) I would argue that you're better off aggregating the data in TSQL (using 'traditional' SELECT...GROUP BY statements) and transforming it in your middle tier / client application where there are better data structures that can be used to represent extremely wide or sparse result sets.

A better pattern to follow is to realise that the X-axis can be used to specify fixed ranges instead of raw values. Using this technique you can select ranges that will guarantee that you have covered off all possible X-axis values. For example, instead of allowing the raw unique values of a column to destabilise the PIVOT, the following query maps the values in the TotalDue column in a number of well-known buckets and then uses those values as the X-axis. Using this technique there is no possibility of having rows fall outside the X-axis range:

select Pvt.*
from (
    select datepart(yyyy, OrderDate) as [OrderDate],
        case
            when TotalDue < 100 then 'NoMargin'
            when TotalDue >= 100 and TotalDue > 1000 then 'OnTarget'
            else 'ExtraBonus' end as [Profit],
        PurchaseOrderId
    from Purchasing.PurchaseOrderHeader) as RawData
pivot (
    count(PurchaseOrderId)
    for Profit
    in([NoMargin], [OnTarget], [ExtraBonus])) as Pvt
order by 1;

OrderDate NoMargin OnTarget ExtraBonus
----------- ----------- ----------- -----------
2001 0 4 4
2002 7 105 160
2003 26 399 607
2004 68 1070 1550

As we've seen, the PIVOT operator does not provide any transformational capabilities that weren't possible in SQL Server 2000 (with liberal use of SELECT...CASE). However, as any experienced developer will know, the easier it is to express a reasonably complex SET-based problem, the better chance you have of getting it right, first time. Even more importantly, the poor developer required to maintain it (after you've moved on to greater conquests) will have a better chance of getting it right for the second time!

Labels: ,

Saturday, 26 August 2006

The APPLY Operator

posted by Stefan Delmarco

We've come across the APPLY operator before when we were exploring DMVs. The APPLY operator has been added to the TSQL repertoire in SQL Server 2005. It fills an important gap with User-Defined Functions (UDFs).

UDFs are less well-known than stored procedures (SPs) and are often overlooked by inexperienced SQL developers. BOL has a good write-up on how to choose between SPs and UDFs. For the context of this article, a brief overview of UDFs:

UDFs come in two flavours: those that returns scalars and those that return tables. Scalar-valued UDFs are very straightforward. They take in a bunch of parameters and return a single scalar value. These are functions in the traditional VB / C# / C++ sense. Table-valued functions are a little different. They also can take in a bunch of parameters. However, they are able to return a table instead of a scalar value. This returned table is referenced in TSQL in the same manner as a table or a view. This makes them look like a cross between functions and views. In fact, there are two types of table-valued UDFs. Those that contain a single SELECT statement (they look very much like a derived table) called 'Inline Table-Valued Functions' and those that build-up the return table using procedural logic (they look like stored procedures) called 'Multistatement Table-Values Functions'. It is the latter-type that we are interested in.

Multistatement table-valued functions are very powerful as they allow us to encapsulate complex procedural logic in a function when that logic cannot be implemented in a database view (i.e. you cannot express that desired output using a single set-based TSQL expression). You are free to construct the output rows in what ever manner you wish! The only restriction is that you need to declare the schema of the table that the UDF returns upfront as part of the UDF's declaration.

As an example, consider the following table-valued UDF:

create function dbo.udf_DecomposeDns (
    @webServiceUrl nvarchar(1024))
returns @dnsDomains table (
    domain nvarchar(1024))
as begin
    -- Return no rows if NULL...
    if @webServiceUrl is NULL
        return;

    declare @dns nvarchar(1024);

    -- Remove the contents after the DNS...
    declare @slashIndex int;
    set @slashIndex = charindex(N'/', @webServiceUrl);
    if @slashIndex > 0
        set @dns = left(@webServiceUrl, @slashIndex - 1);
    else
        set @dns = @webServiceUrl;

    -- Decompose the DNS into its hierarchical parts...
    declare @dotIndex int;
    set @dotIndex = charindex(N'.', @dns);
    insert @dnsDomains (domain) values (@dns);

    while @dotIndex > 0 begin
        set @dns = right(@dns, len(@dns) - @dotIndex);
        insert @dnsDomains (domain) values (@dns);
        set @dotIndex = charindex(N'.', @dns);
    end;
    return;
end;

This UDF will decompose a URL into its hierarchical DNS domain constituents. For the sake of brevity the URL is assumed to consist of a DNS domain name followed by a URL-path (as in AdventureWorks' Purchasing.Vendor PurchasingWebServiceURL column). For example:

select *
from dbo.udf_DecomposeDns(N'www.extranet.fotia.com/order.asmx')

domain
-------------------------
www.extranet.fotia.co.uk
extranet.fotia.co.uk
fotia.co.uk
co.uk
uk

So far, all this is well within the capabilities of SQL Server 2000. Now, consider the following requirement: Generate a report that will contain the DNS composition of each Vendor's PurchasingWebServiceURL in AdventureWorks' Purchasing.Vendor table.

To achieve this we want to decompose each URL into its domain hierarchy. For each vendor we would expect multiple rows to be returned, equivalent to the number of parts in that Vendor's URL. This would not be possible to do in a single SELECT statement in SQL Server 2000. The SELECT syntax did not have an operator / syntax that would allow a table-valued UDF to be re-evaluated repeatedly for every row it was being joined to. We could only execute a table-valued UDF once in a SELECT statement. We could pass a parameter to the UDF but the parameter value could not be from any column in the resulting set. It could only be a constant value or a declared variable.

This is the gap that the APPLY operator fills in SQL Server 2005. It allows us to get the table-valued UDF to be executed for each row it is being joined to. In this manner the parameters to a table-valued UDF should be seen as the join criteria.

The APPLY operator can be used in two forms: CROSS APPLY and OUTER APPLY. To understand the difference think of the table-valued UDF as a normal table you join to. You'd either use an INNER or an OUTER join. CROSS APPLY is an INNER JOIN, and OUTER APPLY is an OUTER JOIN. If the UDF returns an empty result set for a specific row, that row will be excluded if CROSS APPLY is used. However, if OUTER APPLY is used that row will still be included. NULL values will be substituted for the missing UDF row (identical to a traditional OUTER JOIN). It's that simple.

CROSS APPLY over the Purchasing.Vendor table produces the following result:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    cross apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
------------------- ------------------------------ --------------------------
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Wide World Importers www.wideworldimporters.com/ www.wideworldimporters.com
...

Compare the output with OUTER APPLY. Note the NULL rows substituted for the missing rows from the UDF:

select [Name], PurchasingWebServiceURL, domain
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL);

Name PurchasingWebServiceURL domain
---------------------------------- ------------------------ --------------------
International NULL NULL
Electronic Bike Repair & Supplies NULL NULL
Premier Sport, Inc. NULL NULL
Comfort Road Bicycles NULL NULL
Metro Sport Equipment NULL NULL
Green Lake Bike Company NULL NULL
Mountain Works NULL NULL
Continental Pro Cycles NULL NULL
A. Datum Corporation www.adatum.com/ www.adatum.com
A. Datum Corporation www.adatum.com/ adatum.com
A. Datum Corporation www.adatum.com/ com
Trey Research www.treyresearch.net/ www.treyresearch.net
Trey Research www.treyresearch.net/ treyresearch.net
Trey Research www.treyresearch.net/ net
Anderson's Custom Bikes NULL NULL
Compete, Inc. NULL NULL
...

We are able to bring the full might of TSQL to bear on table-valued UDFs. For example, aggregations can be run on the result set:

select domain, count(*) as [CountOf]
from Purchasing.Vendor
    outer apply dbo.udf_DecomposeDns(PurchasingWebServiceURL)
group by domain
order by 2 desc;

domain CountOf
----------------------------- -----------
NULL 98
com 5
litwareinc.com 1
net 1
northwindtraders.com 1
proseware.com 1
treyresearch.net 1
...

The APPLY operator is indeed a valuable addition to the TSQL. It allows a whole new modular and procedural approach to composing result sets.

Labels: ,

Monday, 7 August 2006

Introducing Common Table Expressions (CTEs)

posted by Stefan Delmarco

The concept of CTEs has been around in SQL Server for a long time. Anyone who's used derived tables before will immediately recognise and understand CTEs. Think of CTEs as formalised sub-queries or temporary views that can be defined and referenced in a single TSQL statement.

I've labelled this article as an introduction as CTEs bring a lot to the TSQL table, which we'll explore in much greater detail in future articles.

Consider the following requirement: you need a result-set that returns the sum of all products sold per product. For each product, you'll need to return a bunch of metadata about that product. In addition, your boss (for reasons only known to him) want these results only for products that are either red in colour or have more than 10 transactions! In SQL Server 2000 you might start writing something similar to the following:

select [Name], Color, sum(Quantity) as [SumQuantity]
from Production.Product P
    inner join Production.TransactionHistory T
        on P.ProductId = T.ProductId
where P.Color = 'red'
    or (    select count(*)
            from Production.TransactionHistory THinner
            where THinner.ProductId = P.ProductID) >= 10
group by [Name], Color;

Whilst this query gets the job done, it does look a bit messy as we're having to perform a correlated sub-query for each ProductID, evaluate it as a scalar and then determine whether or not we're going to calculate the sum of all quantities of all transactions for the associated ProductID. In addition, this TSQL is misrepresenting its purpose in the GROUP BY clause. The requirement stated that the summation needs to be performed per ProductID. We probably got lucky here as Name is unique per Product. If it wasn't we could get some unpredictable results (e.g. if there were two Pink FooBar's that differed only by some other metadata). Any column in the SELECT list that is not an aggregate must be included in the GROUP BY. It is the mix of aggregate and non-aggregate data that has lead to this mistake. A correct way to write this query would be to use a derived table:

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA
        on P.ProductId = PA.ProductID
where PA.[CountOfTransactions] >= 10
 or P.Color = 'red';

This is a much better as we have separated out the aggregate from the additional metadata portion of the query. The inner select statement is a derived table that is uncorrelated with the outer select so that it can execute independently. We have given this derived table the PA alias. Whilst this has given the TSQL much better structure, it does have some limitations. If we wanted to refer to the derived table again we would have to copy-paste it again into the select. For example, if we now wanted to include number of transactions the modulo 10 of the ProductID contained (simulating an arbitrary Product-to-Product relationship), we'd have to copy-and-paste the whole derived table again:

select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA1
        on P.ProductId = PA1.ProductID
    inner join (
            select ProductID,
                count(*) as [CountOfTransactions],
                sum(Quantity) as [SumQuantity]
            from Production.TransactionHistory
            group by ProductID) as PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.[CountOfTransactions] >= 10
 or P.Color = 'red';

Now to introduce CTEs: Rewriting the above query to use a CTE will look as follows,

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)
select P.[Name], P.Color, PA1.SumQuantity, PA2.ProductID, PA2.SumQuantity
from Production.Product P
    inner join ProductAggregate PA1
        on P.ProductId = PA1.ProductID
    inner join ProductAggregate PA2
        on P.ProductId % 10 = PA2.ProductId
where PA1.CountOfTransactions >= 10
or P.Color = 'red';

As a developer, think of a CTE as a local variable declaration, having scope of a single TSQL statement. The WITH statement is the declaration that defines the columns (schema) of the CTE. The AS portion defines the implementation of the CTE. Once declared, the CTE exists as any normal table would in the TSQL statement. As demonstrated above, the CTE can be reused as many times are required in the TSQL statement. In the example above we're using it twice (with different aliases) yet we don't have to re-declare it (as we were required to with derived tables). CTE re-use extends to UNIONs as well. For example:

with ProductAggregate(ProductId, CountOfTransactions, SumQuantity)
as (
    select ProductID, count(*), sum(Quantity)
    from Production.TransactionHistory
    group by ProductID)

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where PA.CountOfTransactions >= 10

union all

select P.[Name], P.Color, PA.SumQuantity
from Production.Product P
    inner join ProductAggregate PA
        on P.ProductId = PA.ProductID
where P.Color = 'red';

A couple of other points to note about CTEs:

  • A CTE can only be used in the TSQL statement it is declared in. If you find yourself repeating a CTE a number of times then you should consider 'upgrading' it to a database view in its own right (assuming you're not using a CTE-only feature - mentioned below).
  • CTE's can be used in SELECT / UPDATE / DELETE / INSERT statements, anywhere a table can be used.
  • The previous statement of any TSQL statement that uses a CTE must be terminated with a semi-colon. This can be a bit infuriating if you're not used to using semi-colons in your TSQL but try to get into the habit of terminating all your TSQL statements with semi-colons (this works even in SQL Server 2000).

Anyone that's looked at CTEs before will have noted that I have not yet mentioned the one feature that everyone immediately highlights when discussing CTE's, i.e. CTE's can be used recursively to join on themselves. In my opinion, the recursive properties of CTEs will only be used in the very few queries that need to unroll / traverse hierarchies. The ability to write modular set-based queries is the primary reason we should all start using CTEs. They make queries easier to maintain and allow complex queries to be broken down into their logical components without sacrificing the power of set-based queries (as opposed to poorly performing but easier to read procedural TSQL).

Labels: ,

Monday, 24 July 2006

VARCHAR(MAX)

posted by Stefan Delmarco

VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) are the SQL Server 2005 replacements for TEXT, NTEXT and IMAGE large. Not only are the replacements far better named than their SQL Server 2000 counterparts, they are also far easier to use. However, before we start rejoicing in their virtues, a word of caution. Just because you can now assign a 2GB string to a local variable doesn't mean you should! The ease with which large-value data types can be accessed and manipulated means that we have a greater need to treat the data responsibly and be aware that we could be handling a large amount of data in a local variable. It is now more important to be aware that you could be moving mountains of data in some very innocent looking TSQL. We must make sure we don't use these powerful capabilities inappropriately. "With great power comes great responsibility" - Spiderman!

In BOL you'll notice that a distinction is drawn between large value types and large object data types. The new MAX types are part of the large value types and the TEXT / NTEXT / IMAGE family are part of the large object types. The only distinction I can find between the two that warrants this separation is the use of the infamous text pointer that was used with large object types. So when you come across large value versus large object types in BOL, this is what they mean.

TEXT, NTEXT and IMAGE are notoriously difficult to use within the database (i.e. within TSQL or stored procedures). It is not possible to use TEXT, NTEXT and IMAGE data types as local variables. SQL Server allows us to use a level of indirection through a 'text pointer'. Essentially this is a handle to a TEXT / NTEXT / IMAGE column of a specific row. In order to access the data in the column we first have to:

  • read the text pointer of the column of the row we were interested in (using the TEXTPTR function), then
  • use the text pointer as a parameter to the READTEXT, WRITETEXT or UPDATETEXT functions.

In addition, we are responsible for controlling the locking / isolation of the text pointer from changes while we are reading it through the use of HOLDLOCK locking hints and within transactions. This prevents anyone from changing the data while we are 'dereferencing' the text pointer. Alternatively you can just SELECT the data into VARCHAR / NVARCHAR / VARBINARY local variables and risk truncating the data. You can also SELECT the column INTO another table that has TEXT / NTEXT / IMAGE columns.

Whilst this sounds like many hoops to jump through, it is actually beneficial as developers resort to handling TEXT / NTEXT / IMAGE data outside the database. Managing these data types from ADO.NET / OLE DB is far easier than trying to manipulate the data within the database. The data can be extracted in plain SELECT statements (instead of READTEXT) and be streamed back to the client in chunks. In addition, this is a far more scalable alternative as it doesn't consume precious memory resources on the SQL Server.

New to SQL Server are the 'MAX' versions of VARCHAR / NVARCHAR / VARBINARY. There are the replacements for TEXT / NTEXT / IMAGE that have now been marked as deprecated and will be removed in a future version of SQL Server. These 'MAX' versions are meant to address the in-database manipulation difficulties of their predecessors. They are also designed to be able to handle small to very large data sizes efficiently without having to choose between VARCHAR / NVARCHAR / VARBINARY and TEXT / NTEXT / IMAGE as before.

All of the following capabilities apply equally to VARCHAR / NVARCHAR / VARBINARY. For brevity I'll just refer to VARCHAR.

You can now declare local variables of type VARCHAR(MAX) so that you can construct strings greater than 8,000 bytes (8,000 VARCHAR characters or 4,000 NVARCHAR / Unicode characters) in size and select VARCHAR(MAX) column values directly into your local variables. This now makes cursors using large value types usable in the database (again, just because you can doesn't mean you should…).

The concept of a text pointer no longer exists. Goodbye TEXTPTR, TEXTVALID, READTEXT, UPDATETEXT and WRITETEXT!

All of the string functions have been 'upgraded' so that they can be used with VARCHAR(MAX). Previous only PATINDEX, DATALENGTH and SUBSTRING could be used with a TEXT column. Now all of the family favourites like LEN, LEFT, RIGHT, etc. can now be used on large-value data types. However, note that there are implicit overloads to some of these string functions. As an example, execute the following TSQL and note the output:

declare @smallVarChar varchar(1)
set @smallVarChar = '*'

select datalength(replicate(@smallVarChar, 8001)) as LengthOfSmallVarChar
go

LengthOfSmallVarChar
--------------------
8000

declare @maxVarChar varchar(max)
set @maxVarChar = '*'

select datalength(replicate(@maxVarChar, 8001)) as LengthOfMaxVarChar
go

LengthOfMaxVarChar
--------------------
8001

The first TSQL batch calls the REPLICATE function with a small (non-MAX) VARCHAR. In return, REPLICATE returns a non-MAX VARCHAR that is truncated at 8,000 characters. However, if we pass in a VARCHAR(MAX) to REPLICATE we get a VARCHAR(MAX) back out and no truncation. So watch out for this. The transition from small to MAX VARCHARs is not implicit. If your requirement is to accommodate large VARCHARs then this needs to be baked into your code from the outset.

VARCHAR(MAX) and its counterpart TEXT can interact with each other. For example, the following TSQL is perfectly legal:

-- Create a table with a TEXT column...
create table bigTable (
    col1 text not null)
go

-- Add some data...
declare @col1 varchar(max)
set @col1 = replicate(cast('*' as varchar(max)), 10000)

insert bigTable (col1)
values (@col1)
go

-- Read out the TEXT into a VARCHAR(MAX)
declare @col1 varchar(max)
select @col1 = col1
from bigTable

select len(@col1)
go

This is very useful as it means you don't have to redefine all of your TEXT columns in your database when you migrate from SQL Server 2000 to 2005. If you need to manipulate TEXT data in the database just SELECT it into a VARCHAR(MAX) variable.

Lastly, a couple of points to keep in mind with VARCHAR(MAX):

  • The size range for VARCHAR is now 1 to 8,000 or MAX. Note there is nothing between 8,000 and MAX.
  • VARCHAR(MAX) is more than a replacement from TEXT. It works just as well with short strings as well as long ones. Therefore, if there is any possibility that you'll exceed 8,000 bytes then use VARCHAR(MAX).
  • The maximum size for a VARCHAR(MAX) is 2^31 - 1 bytes / characters.
  • The [text in row] option for a table remains specific to TEXT / NTEXT / IMAGE data types. This option lets you tell SQL Server at which size to move TEXT / NTEXT / IMAGE data out of the table's data pages and into its own page allocation (assuming that the [text in row] option is ON). The new [large value types out of row] setting replaces the TEXT option for VARCHAR(MAX). However, this option is now binary. It is either ON or OFF. The size at which MAX types are moved out of a table's data pages and into its own data pages is now fixed at 8,000 bytes if [large value types out of row] is set to OFF.
  • The UPDATE statement has been enhanced to provide a replacement for UPDATETEXT. To update a portion of a MAX data type use the .WRITE suffix. For example:
-- Create a table with a VARCHAR(MAX) column...
create table bigTable (
    col1 varchar(max) not null)
go

-- Add some data...
declare @col1 varchar(max)
set @col1 = replicate(cast('1' as varchar(max)), 10000)

insert bigTable (col1)
values (@col1)
go

-- Update a portion of the VARCHAR(MAX)...
-- Update the 4th, 5th and 6th characters, truncate the rest
-- of col1.
update bigTable
set col1.write('222', 3, null)

select *
from bigTable
go

col1
-------
111222

Next up are CTEs (Common Table Expressions).

Labels: ,

Thursday, 6 July 2006

The OUTPUT Clause

posted by Stefan Delmarco

If any new feature of TSQL will be a big hit it will be the OUTPUT clause. This is one new addition to TSQL that you'll use everywhere! How many times have you written this code?

-- A typical table...
create table someTable (
    Id int identity(1,1) primary key clustered,
    SomeData varchar(100) not null);
go

declare @id int;
-- Insert the data.
insert someTable (someData)
values ('DataOne');

-- Figure out what identity the row was given.
set @id = scope_identity();
-- Return it...
select @id as Id;
go

Isn't it tedious that you have to read the data that you just inserted as a second operation following the insert? The whole reason scope_identity exists is to protect you from any context that might be introduced between these two operations that you may not be aware of, like triggers (aarrgh!). After all, why can't you just write the following...

declare @id int;

-- Insert the data, better...
insert someTable (someData)
output inserted.Id
values ('DataTwo');
go

This is exactly how the OUTPUT clause works. It allows data that is inserted / updated / deleted to be returned. Anyone that's done any work with triggers would have immediately recognised the 'virtual' inserted table. In fact, drawing parallels between triggers and the OUTPUT clause is the best way to understand what's going on here. The OUTPUT clause allows you to access the virtual trigger tables inline with your INSERT / UPDATE / DELETE statements. You essentially get 'on-the-fly' trigger access. For INSERT you have the inserted table, for update you have both the inserted and deleted tables, and for DELETE just the deleted table. Simple, isn't it? In the above example we use the OUTPUT clause to get to the identity column we inserted. It could have been any column in the table. Think how useful this would be for retrieving timestamp values or computed columns.

Have a look at these examples to see what some of the practical applications are:

(1) What did I just delete?

delete someTable
output deleted.*
where id = 1;
go

This query returns all of the data that was just deleted. Great for audit records! It also prevents having to take out locks on the rows you're going to delete so that the values read are the ones deleted. No encompassing transaction required here.

(2) Before and after values

update someTable
set someData = 'DataThree'
output deleted.someData as Before,
    inserted.someData as After,
    difference(deleted.someData, inserted.someData) as Delta
where id = 2;
go
Before After Delta
--------- ---------- ------
DataTwo DataThree 3

This query will return the values for someData as it was before the update and it is after the update in a single row. In addition, note that the OUTPUT clause is defined as a meaning that any expressions (i.e. functions / calculations, etc.) you could do in a SELECT you can do in an OUTPUT.

(3) Deleting multiple rows

-- Declare a table variable.
declare @someTableTemp table (
    Id int not null,
    someData varchar(100) not null);

-- Capture the output into a table variable.
delete someTable
output deleted.id, deleted.someData
into @someTableTemp
    (Id, someData);

-- What was deleted...
select *
from @someTableTemp;
go

The output can also be captured into a table variable. It doesn't have to be a table variable. It could be any permanent table. In this case think of the OUTPUT ... INTO as an INSERT ... SELECT. Just for the hell of it you can write crazy TSQL like this:

-- The delete that never was!
delete someTable
output deleted.someData
into someTable (someData);
go

Lastly, a couple of points to remember:

  • The values in the output virtual tables will be populated after any triggers for the table have fired, so remember that you are at the back of the queue.
  • Inserting / updating / deleting data and then re-reading the rows affected is always dangerous because it makes you susceptible to deadlocks. Using the OUTPUT clause will avoid this as the action and read occur as a single unit. In SQL Server 2000 there was a trick with the UPDATE statement that allowed you to avoid re-reading updates if you're interested in the value of a specific column that was updated (this only works if the UPDATE affects a single row). Consider the following table:

create table ReserveRange (
    IdName varchar(100) not null primary key clustered,
    Id int not null);
go

insert ReserveRange (IdName, Id)
values ('Id#1', 0);
go

You want to use this table to reserve ranges of Id's. You need to:

  • read the current Id,
  • increment it by a certain amount,
  • return the range that has been reserved in a thread-safe manner.

Normally developers wrap the TSQL in a transaction and use a UPDLOCK locking hint to serialize access to the table like this (a couple of variations of this are possible but all require the transaction and a long-running update lock):

begin tran;

declare @range int;
set @range = 100;
declare @id int;

-- Get the current value.
select @id = Id;
from ReserveRange with(updlock)
where IdName = 'Id#1';

-- Reserve a range.
update ReserveRange
set id = @id + @range
where IdName = 'Id#1';

-- Return the range
select @id + 1 as LowRange,
    @id + @range as HighRange;

commit tran;
go

The trick is to notice that the UPDATE statement supported triple assignments. Have a look at the following:

declare @range int;
set @range = 100;

declare @id int;

-- Update the current range and assign it to
-- a local variable
update ReserveRange
set @id = id = id + @range
where IdName = 'Id#1';

-- Return the range
select @id - @range + 1 as LowRange,
    @id as HighRange;

Note, no encompassing transaction and no long-running locks. Next we'll look at VARCHAR(MAX).

Labels: ,

Thursday, 29 June 2006

Dynamic Management Views Part II

posted by Stefan Delmarco

In this instalment we'll be looking at a single DMV. This DMV provides such a wealth of information that it will take quite a few passes to cover it in depth. If anything, this DMV will make detecting and troubleshooting poorly performing queries something that can be easily done rather than requiring specialised, and potentially performance damaging tools like Profiler. The problem with Profiler is that it is great for debugging and dissecting SQL Server requests in a controlled environment (such as your development or test environment). It isn't good at running perpetually in a production environment, logging every single query that is executed, with its query plan and runtime statistics. Firstly, the amount of data it would generate for even a modest enterprise-level application would be unmanageable. Secondly, depending on what events are being captured and their frequency, the performance of SQL Server could be adversely affected. I'm not saying that SQL Profiler shouldn't be used in a production environment to assist with investigating and diagnosing a specific issue. Rather, it can't be used by the DBA's to monitor the overall performance of SQL Server, so that they can look for potentially troublesome queries, and be proactive in the detection of performance issues. This task normally falls on the middle-tier / application code to log execution times or update custom performance counters. Even then, it is normally support engineers that look at these logs rather than the DBAs.

In SQL Server 2005 we now have the sys.dm_exec_query_stats DMV. This view contains a row for each query plan in SQL Server's query plan cache. SQL Server is built on the principle that you prepare a query plan once and execute it many times. Caching of the query plan is central to the architecture of SQL Server. Exposing this core part of the internal architecture really means that there are no more secrets!

So, what can we see in this DMV? Firstly, we can obtain a handle to the query plan's TSQL text. In SQL Server 2000 we had to use DBCC INPUTBUFFER or the ::fn_get_sql function (which has now been marked as to-be-deprecated) to be able to see what TSQL a session was executing.

Even then DBCC INPUTBUFFER just showed the last statement executed and ::fn_get_sql only returned the TSQL as it was executing. Neither provided any kind of insight into the actual content of the query plan cache. The only way I knew of was to take a crash dump of the SQL Server process and dissect the dump using Microsoft PSS tools that understood SQL Server's memory layout.

To have a look at every TSQL statement in the query plan cache, execute the following query over the sys.dm_exec_query_stats DMV:

select db_name(sql_text.dbid) as [Database],
    case db_id()
        when sql_text.dbid then object_name(sql_text.objectid)
        else cast(sql_text.objectid as sysname) end as [Object],
    sql_text.text as [TSQL]
from sys.dm_exec_query_stats
    cross apply sys.dm_exec_sql_text(sql_handle) as sql_text;

If you don't see any rows returned, execute this query twice and you'll see the query above appear in the cache! If you see too many rows use DBCC FREEPROCCACHE to empty out the query plan cache.

There are a couple of things in this query you might not have seen before. For example, the APPLY operator is new. We'll look at it in detail in another article but for now just think of it as an inner join from a query to a table-valued function. The table-valued function will be invoked for each row in the query. If the table valued function returns multiple rows a query row, the query row will be returned multiple times (as you'd expect from an inner join).

The CROSS APPLY is taking each sql_handle in the sys.dm_exec_query_stats DMV and uses the sys.dm_exec_sql_text function to return the database, object id and text of the query plan. The case logic is there just to turn the objectid into its object name if the current database is the same as object's database.

OK, so this is all very interesting, but how does this help me see what the performance of these queries is like? Well, for each query plan, SQL Server will maintain an aggregate of:

  • The number of times query plan has been recompiled (find all your SPs containing temp tables here!)
  • A handle to the query plan (we'll look at this in depth later).
  • The last time the query was executed.
  • The number of times it has been executed (how easy is it to spot the most used queries now?!).
  • The total amount of the following resources consumed by all invocations of the query plan as well as the least and greatest amount of CPU consumed by a single invocation of the query plan:
    • CPU
    • Physical Reads
    • Logical Writes
    • Logical Reads
    • CLR Time (if you use CLR in SQL you'd be able to easily spot a performance problem with it!)
    • Elapsed Time

Wow! Christmas definitely came early this year! This mind boggles at the variety of queries that could be constructed to slice and dice this aggregated data. When digging into the TSQL note that the query plan may only be for a fragment of the TSQL text. You need to use the statement_start_offset and statement_end_offset values to find the chunk. Use this helper function to extract the TSQL fragment:

create function dbo.GetTSQLFragment (
    @tsql nvarchar(max),
    @start_offset int,
    @end_offset int) returns nvarchar(max)
with execute as caller
as begin
    declare @len int;
    set @len = (case
        when @end_offset = -1 then len(@tsql) * 2
        else @end_offset end - @start_offset) / 2;

    declare @TSQLFragment nvarchar(max);
    set @TSQLFragment = substring(@tsql, @start_offset / 2 + 1, @len);
    return @TSQLFragment;
end;

The query now becomes:

select db_name(sql_text.dbid) as [Database],
    case db_id()
        when sql_text.dbid then object_name(sql_text.objectid)
        else cast(sql_text.objectid as sysname) end as [Object],
    sql_text.text as [AllTSQL],
    dbo.GetTSQLFragment(
        sql_text.text,
        statement_start_offset,
        statement_end_offset) as [FragmentTSQL]
from sys.dm_exec_query_stats stats
    cross apply sys.dm_exec_sql_text(sql_handle) as sql_text;

To see the fragments in action, run the following query and then have another look at the query plan cache using the above query:

select * from Production.Culture;
select * from Production.Product;
select * from Production.Location;

You'll see the above query repeated three times with each select statement in its own TSQL fragment and will therefore have its own query plan. However, note the presence of the plan_handle column. This can be used to group query plans together as they were submitted as part of the same batch. Include plan_handle in the SELECT from sys.dm_exec_query_stats, execute variations of the triple-select batch statement above (just reordering is sufficient) and observe how the query plans for the same SELECT statements multiply. For example, run the following batches:

select * from Production.Location;
select * from Production.Culture;
select * from Production.Product;
go

select * from Production.Culture;
select * from Production.Location;
select * from Production.Product;
go

select * from Production.Location;
select * from Production.Product;
select * from Production.Culture;
go

It's not hard too see why ad-hoc TSQL is bad when you see what it does to your query plan cache. For stored procedures, no matter how large or complex the stored procedure is, there will only ever be one query plan for it. However, unfortunately there is some fine print. Query plans that use different SET (environment) options cannot be shared. More on this later.

Execute the following batch and have a look at the query plan cache. There is only one query plan for the stored procedure, irrespective of the ordering and parameters:

exec dbo.uspGetBillOfMaterials 800, '2003/01/01';
exec dbo.uspGetBillOfMaterials 801, '2005/01/01';
exec dbo.uspGetBillOfMaterials 800, '2004/02/08';
exec dbo.uspGetBillOfMaterials 800, '2001/02/08';
go

exec dbo.uspGetBillOfMaterials 799, '2004/02/08';
exec dbo.uspGetBillOfMaterials 800, '2003/01/01';
go

exec dbo.uspGetBillOfMaterials 780, '2004/02/08';
exec dbo.uspGetBillOfMaterials 721, '2001/02/08';
go

Now for the fine print. There are a number of factors that affect the optimiser's ability to re-use query plans. It is important to be aware of the factors as query plan re-use == performance (compiling query plans can be expensive). One of the most prominent factors is the SET (environment) options of the connection executing the query. For a query plan to be re-used, the environment options of the cached query plan needs to be the same as the query being executed. Query optimisation and query plan re-use is an enormous topic that we'll cover over the course of the 'Discover SQL Server 2005' articles. However, for now, note that executing the following query results in two different query plans being cached for the same stored procedure. Execute the query then have a look in the cache using the above DMV query:

set ansi_warnings on
exec
dbo.uspGetBillOfMaterials 721, '2001/02/08';
go

set ansi_warnings off
exec
dbo.uspGetBillOfMaterials 721, '2001/02/08';
go

More often than not environment options are a non-issue as all connections to an application's database will be instantiated by some common data access layer code, so all application connections inherently use the same environment options.

Lastly, to illustrate the power of this DMV, look at this. Want to know which are the 50 most CPU intensive queries in your database? This is all that you need:

select top 50 db_name(dbid) as [Database],
    case db_id()
        when sql_text.dbid then object_name(sql_text.objectid)
        else cast(sql_text.objectid as sysname) end as [Object],
    sql_text.text as [AllTSQL],
    dbo.GetTSQLFragment(
        sql_text.text,
        statement_start_offset,
        statement_end_offset) as [FragmentTSQL],
    plan_handle,
    total_worker_time
from sys.dm_exec_query_stats
    cross apply sys.dm_exec_sql_text(sql_handle) as sql_text
order by total_worker_time desc;

We'll be visiting many more DMVs and sample queries off the DMVs in future articles.

Labels: ,

Tuesday, 13 June 2006

Dynamic Management Views (DMVs) Part I

posted by Stefan Delmarco

In SQL Server 2000 we had to dig into the system tables to take a peek under the covers. Often many of these tables were undocumented and we had to rely on reverse engineering, sp_helptext, and the debugger to figure out what was going on. In addition, Microsoft was at pains to tell us that these tables are undocumented and as such they should be treated as 'internal-only'. In SQL Server 2005, Microsoft has really listened to their customers. We now have a set of virtual tables known as Dynamic Management Views (DMVs).

A lot needs to be read into the name. Firstly, they've called them dynamic not only because they change dynamically with the internal state of SQL Server, but also because Microsoft reserves the right to change the DMV's schema in future versions (leopards and spots comes to mind...). Management because they're used to inspect the state of SQL Server and diagnose issues. Views because they're not actual tables (actually, some of them are even functions). They're just a read-only table-like representation of SQL Server in-memory structures (a bit like sysprocesses in SQL Server 2000). The difference between the old system tables and DMVs is that DMVs are documented and fully supported by Microsoft. They reside in the sys schema (so all the DMVs must be prefixed with sys. - more on schemas vs. owner later) and have a dm_ prefix.

Just listing the DMVs with a one line'r would make this a very boring article. I'd rather show you which ones I think are cool and will be useful to use. However, there are so many that it will take some time for all the buried treasure to be found. Broadly speaking the DMVs are split along functional lines. There are DMVs for CLR-integration, I/O, replication, full-text, and many others. Each of these categories has its own prefix. For example, CLR related DMVs all begin with sys.dm_clr_*. To be able to view DMVs does not require sysadmin privileges. The new VIEW SERVER STATE permission is all the is required for most DMVs (i.e. can look but cannot touch). I know some people that will be very happy to hear this!

Whenever I first get my hands on a SQL Server, the first think I like to do is have a look at what processes are running on the SQL Server. In SQL Server 2000 this would be sysprocesses (albeit through sp_who/2). This view still exists in SQL Server 2005 for backwards compatibility.

There isn't just a single replacement DMV for sysprocesses. There are at least 3! sysprocesses was always a bit confused in that it mixed connection, session and request (what was being executed) into a single table. SQL Server 2005 has split this information up into 3 separate DMVs. This DMVs belong in the "Execution Related Dynamic Views" category and carry the sys.dm_exec_* prefix:

  • sys.dm_exec_sessions
  • sys.dm_exec_connections
  • sys.dm_exec_request

To call these DMVs a replacement for sysprocesses is doing a disservice to the DMVs. It would be like saying that the Ford GT90 is a replacement for the Focus! It is so much more, in a different league!

First, let's have a look at sys.dm_exec_sessions as this has the closest resemblance to sysprocesses. One row exists in this view for every connection / session to SQL Server. Run the following query to see what gets returned:

select *
from sys.dm_exec_sessions;

At first glance this looks pretty much the same as sysprocesses, but keep scrolling to the right. You'll find that each session's user / environment options are clearly visible across all connections!

ANSI_NULLS, DATEFORMAT, etc. are all there! In one foul swoop you can now check that all the sessions to the database are using the correct connection properties. Previously this information was only available by using the undocumented DBCC PSS command. Even then, no aggregated view like this DMV was available. Even better, the transaction isolation level is now visible as well. I don't know how many times I've had to convince a developer that their transaction actually gets upgraded to SERIALIZABLE when started under COM+! With this column, it will be the easiest thing in the world to check! Also observe the row count column (how many rows were returned in the last result set returned), previous error number and last request start / end times - all priceless when diagnosing issues.

Depending on how familiar you were with sysprocesses you'll have noticed sys.dm_exec_sessions did not contain all of the columns from sysprocesses, particularly the connection oriented columns. The DMV containing connection specific columns is sys.dm_exec_connections. Run the following query to get a flavour of what the table contains:

select *
from sys.dm_exec_connections;

You'll notice that this table contains a smaller number of rows than sys.dm_exec_sessions. This is because sys.dm_exec_sessions includes system sessions as well (such as the lazy writer, service broker, etc.).

The sys.dm_exec_connections DMV only contains local / remote connections to the SQL Server. This does not include any internal system connections (which aren't real connection anyway). Again, this DMV contains far more detail than sysprocesses. Amongst the new columns we can now see the packet size each connection is using, number of bytes that have been sent / received for each connection (see how the number of ways to spot the resource hogs has increased with DMVs?) and which authentication scheme the connection is using. Anyone that tried to figure whether a connection was using NTLM or Kerberos in SQL Server 2000 will love that!

The last DMV for this instalment is sys.dm_exec_requests. Again, ping the DMV to see what comes back:

select *
from sys.dm_exec_requests;

We can use this DMV to get the overall picture of what requests are currently executing, get handle to the TSQL and plan they are executing, see if the request is blocked / blocking (same as SQL Server 2000). Note that this DMV duplicates some of the session information from sys.dm_exec_sessions so you don't have to join the DMVs. Typically you'd want to see session- and request-oriented data together.

So, you might be wondering, given the amount of detail available in DMV's is there a way to be able to dig into them easily without having to come up with sp_who3...? Luckily Microsoft has integrated portions of Reporting Services into the Query Analyser replacement: SQL Server Management Studio (SSMS). Bring up the Reports window and select one of the many available. Context of the report is driven by the current selection in Object Explorer. For those that don't have ready access to SSMS have a look at the attached screen shot and be wow'd!

Now that we've had a bit of an introduction to DMVs as a transition from sysprocesses, next we'll have a look at the power user's DMVs!

You can download the SQL Server 2005 system view map (which includes DMVs) from the Microsoft Download Center.

Labels: ,

Monday, 29 May 2006

Snapshot Isolation Part II

posted by Stefan Delmarco

Previously we started exploring the new row versioning / snapshot isolation features in SQL Server 2005. We saw how writers no longer block readers when the READ_COMMITTED_SNAPSHOT database option has been set. The rule of thumb is that read committed snapshot will be used when a shared (read) lock is being requested. Exclusive or update locks will never use row versioning. Depending on the logic of the TSQL, sometimes it is necessary to drop out of snapshot isolation and ensure that the row being read is current. However, bear in mind that any code which forces SQL Server to serialise access to a resource will become a contention hotspot.

In this instalment we'll continue our investigation into snapshot isolation and dig into its more advanced usage in the form of the snapshot transaction isolation level (i.e. read committed snapshot versus snapshot transaction isolation) . Whereas read committed snapshot was very specific in terms of only applying to shared locks, snapshot transaction isolation places a context around the entire transaction, affecting all locks. With read committed snapshot the latest version of a row was read by all connections as soon as the owning transaction committed. When a transaction utilising snapshot transaction isolation level begins, SQL Server notes when the transaction began relative to all other transactions executing in the database at that point-in-time. SQL Server uses this marker to determine which version of a row the transaction will read for its entire duration. This behaviour is best demonstrated with an example.

The following TSQL creates a database, enables snapshot isolation level, creates a table and adds some data:

create database DiscoverYukon;
go
alter database DiscoverYukon
set allow_snapshot_isolation on;
go

use DiscoverYukon;
create table someTable (
    id int not null primary key clustered,
    someData varchar(100));
go

insert someTable (id, someData)
select 1, 'One'
union all
select 2, 'Two'
union all
select 3, 'Three';
go

Again, we'll need two separate transactions. On connection #1 we start a transaction and update the table:

-- Connection #1...
use DiscoverYukon;
go

set transaction isolation level snapshot;
begin transaction;

update someTable
set someData = 'Three.One'
where id = 3;
go

Note the most obvious difference is the presence of SET TRANSACTION ISOLATION LEVEL. This option has always been present in SQL Server. It governs how transaction locks will be treated for this connection. However, previously the only possible values were READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ and SERIALIZABLE. SNAPSHOT is new to SQL Server 2005. Setting the transaction isolation to snapshot has to be done explicitly, either on the connection or in the TSQL text . On connection #2 we now try read the row that was updated in connection #1. On connection #2 execute:

-- Connection #2...
use DiscoverYukon;
go

set transaction isolation level snapshot;
begin transaction;

select *
from someTable
where id = 3;

id someData
---- -----------
3 Three

As you probably guessed, this returns the previous version of the row, before connection #1 updated it. However, now let's try and update the row from connection #2.

-- Connection #2...
update someTable
set someData = 'Three.Two'
where id = 3;

Same as before, this update blocks as SQL Server will ensure that we only update the current version of a row. Let's commit the transaction on connection #1:

-- Connection #1...
commit transaction;

The query on connection #2 now completes with the following error:

Msg 3960, Level 16, State 2, Line 2 Snapshot isolation transaction aborted due to update conflict. You cannot use snapshot isolation to access table 'dbo.someTable' directly or indirectly in database 'DiscoverYukon' to update, delete, or insert the row that has been modified or deleted by another transaction. Retry the transaction or change the isolation level for the update/delete statement.

We got this error because when connection #2's transaction began, SQL Server noted that the row was already being updated by connection #1. With this knowledge, SQL Server ensured that connection #2 would not be allowed to modify any data that was being versioned when connection #2's transaction began. The reason for this behaviour will become apparent if we re-run the sample again. On connection #1 execute:

-- Connection #1...
set transaction isolation level snapshot;
begin transaction;

update someTable
set someData = 'Two.One'
where id = 2;

On connection #2 execute:

-- Connection #2...
set transaction isolation level snapshot;
begin transaction;

select *
from someTable
where id = 2;

id someData
---- -----------
2 Two

Again, this returns the versioned row as it was when connection #1 began its transaction (the update from the first sample). Now let's commit connection #1's transaction and re-read this row:

-- Connection #1...
commit transaction;

And on connection #2:

-- Connection #2...
select *
from someTable
where id = 2;

id someData
---- -----------
2 Two

Create another separate connection #3 and run the same SELECT:

-- Connection #3...
use DiscoverYukon;
go

select *
from someTable
where id = 2;

id someData
---- -----------
2 Two.One

Now it should become apparent what's going on. Connection #2's SELECT returns the same versioned row even though connection #1 has committed its transaction. The latest version of the row has someData set to 'Two.One' as can also be witnessed by the data returned by connection #3. SQL Server will ensure that any rows that are versioned when a transaction begins will only be available as versioned rows for the duration of that transaction. The transaction is essentially forbidden to update any rows that were under the influence of another transaction when the transaction began. This also includes not being able to modify any rows that are changed by another transaction after the transaction began.

For OLTP systems I can see read committed snapshot playing a far bigger role than snapshot isolation due to the update conflict complication with snapshot transaction isolation level. However, snapshot isolation will have a role to play when the probability of causing conflicts is low. For example, if a query needs to generate a report and must ensure that the data was consistent, at a point in time, when the report was run, snapshot transaction isolation level will be beneficial (the report does not update any data). In this manner it can act as a very lightweight alternative to REPEATABLE READ or SERIALIZABLE transaction isolation levels which typically pay a very high concurrency price. An acceptable pattern to follow when using snapshot isolation is to automatically retry in the event of an update conflict. Many large database systems already have data access layers that retry in the event of a deadlock.

As a taster for the next instalment, here is a very useful query using a DMV. It returns how long a row has been versioned for. Big numbers mean long running transactions (which are always bad!). Start a couple of transaction using either of the two snapshot isolation levels and observe the output. Use this to help find the long running culprits!

select session_id as [spid],
    elapsed_time_seconds
from sys.dm_tran_active_snapshot_database_transactions
order by 2 desc;

Labels: ,

Tuesday, 9 May 2006

Snapshot Isolation Part I

posted by Stefan Delmarco

Our investigation into snapshot isolation will be split into a number of parts. We'll firstly introduce the basic concept of what snapshot isolation means and how it can be used in its most basic form. Then we'll have a look at the advanced use of snapshot isolation as a full-blown transaction isolation level.

At its core, the SQL Server engine shares many similarities with a source code version control system. In fact, many of the terms used in snapshot isolation level have direct counterparts with applications like Rational's ClearCase version control system. In ClearCase, every file in the repository has an associated version number. When developers wants to make changes to a file, they first need to 'check it out'. This prevents any other developer from being able to make any changes to that file while it is being modified. In reserved mode no other developer can check the file out till the file has been checked back in. This is the same behaviour as SQL Server 2000 and SQL Server 2005 without snapshot isolation. When a row is being inserted / updated / deleted, the row is essentially 'checked out' till the transaction that performed the 'check out' either commits or rolls back. No other connection can even look at the row till it is 'checked back in'. To change this behaviour TSQL allows you specify locking hints such as NOLOCK (look at the row anyway, even though the data may only be partly changed - be very wary of using this) or READPAST (pretend that the row isn't even there).

Back to the ClearCase analogy. Even though a file has been checked out, developers can still look at the previous version of a file. For example, if HelloWorld.cs was checked out at version 3, developers can still look at versions 1 through 2. Version 3 only becomes visible when the developer has checked HelloWorld.cs back in. Well, this is precisely what snapshot isolation does. SQL Server maintains a version history of all rows that are in the process of being created / changed / removed. Does this mean that you can have a look at all previous versions of a row since it was created? No. SQL Server only maintains the last version of a row while it is being modified. Where are these previous versions stored? Is my database size going to multiply? No, previous version of a row are transient and are stored in TempDb.

All that snapshot isolation is trying to do is to prevent readers from blocking writers. That is, if a reader is trying to SELECT a row that is currently being altered by another writer, SQL Server will return the last version of the row that was committed instead of forcing the reader to wait till the writer commits. No more blocking...most of the time anyway(read on!).

As mentioned earlier, there are two modes of operation with snapshot isolation. We'll be looking the most basic mode of operation, known as READ_COMMITTED_SNAPSHOT. All that we need to know at this stage is that this behaviour applies when the transaction is running under the READ COMMITTED transaction isolation level (as opposed to READ UNCOMMITTED, REPEATABLE READ or SERIALIZABLE). This mode of behaviour is best observed with an example.

create database DiscoverYukon;
go
use DiscoverYukon;
go

-- Enable snapshot isolation for read committed.
alter database DiscoverYukon
set read_committed_snapshot on;
go

-- Create a test table.
create table someTable (
    Id int not null primary key clustered,
    SomeData varchar(255));
go

READ_COMMITTED_SNAPSHOT needs to be enabled at the database level. The default in the Model database (the template used whenever creating a new database) has READ_COMMITTED_SNAPSHOT set to OFF.

We'll dig into snapshot isolation using two connections. On the first connection, execute the following TSQL to insert a new record into the test table. We wrap this in a transaction so that we can interact with it from another connection:

-- Connection #1...
begin transaction
insert someTable (Id, Somedata)
values (1, 'FirstVersion');

On the other connection we simulate a user trying to read the same row:

-- Connection #2...
select * from someTable;

This query immediately comes back with no records! In SQL Server 2000 the SELECT would have blocked till connection #1 committed (try it out!). But, what if connection #2 tries to insert a record with the same Id. What if we added some 'insert Id 1 if it does not exist'-type logic?

-- Connection #2...
if not exists (
    select top 1 1
    from someTable
    where id = 1) begin

    insert someTable (Id, Somedata)
    values (1, 'FirstVersion');
end

When running this query you'll notice that it has reverted back to the SQL Server 2000 behaviour. It blocks till connection #1 commits. However, when connection #1 commits something that may be unexpected happens. Connection #2's query fails with the following error:

Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__someTable__7C8480AE'. Cannot insert duplicate key in object 'dbo.someTable'.
The statement has been terminated.

Connection #2 was not blocking on the SELECT statement! That completed successfully. It was the INSERT that was blocking. SQL Server was not allowing a previous version of the row to be changed (in this case a row with the same Id to be created) even though snapshot isolation was enabled. What you read is not always what you change when snapshot isolation is in use. Therefore, care must be taken when trying to write this type of query. In this instance, the SELECT needs to use the new locking hint available in SQL Server 2005, namely READCOMMITTEDLOCK. This is not very intuitive as the existing lock hint is READCOMMITTED. The difference between these two is that READCOMMITTED will still honour the READ_COMMITTED_SNAPSHOT setting while READCOMMITTEDLOCK will not. READCOMMITTEDLOCK will not read previous versions of the row and will wait till it obtains a read lock on the actual row (it is oblivious to previous versions). Therefore the TSQL needs to be rewritten as:

-- Connection #2...
if not exists (
    select top 1 1
    from someTable with(readcommittedlock)
    where id = 1) begin

    insert someTable (Id, Somedata)
    values (1, 'FirstVersion');
end

In this case connection #2 will block on the SELECT till connection #1 commits. When it does, connection #2 completes normally and does not attempt to insert a duplicate row (the IF returns FALSE). Next we'll look at how UPDATEs are affected by snapshot isolation.

Labels: ,

Friday, 21 April 2006

TRY...CATCH

posted by Stefan Delmarco

I'll start off by saying that I am not a fan of the error handling practises that this syntax promotes. People are often surprised when I tell them that, in my opinion, BEGIN TRAN, ROLLBACK TRAN, SELECT @@ERROR, etc. do not belong in stored procedures. Think of it this way, have you ever seen any code in a stored procedure that actually does anything useful when an unexpected error occurs, besides exiting the stored procedure? Then why do we have to have the standard 10 lines of boiler plate 'error handling' TSQL that:

  • tests @@ERROR after every DML statement,
  • does a GOTO to an ErrorHandler label, and
  • does a RAISERROR of 'an unexpected error occurred'?

If the intention is just to stop when an error occurs then the SET XACT_ABORT statement is all that is needed at the start of the batch / connection. In addition, if you're using .NET then you have the SqlException class available. This exception class (which gets thrown by the SqlConnection class when a warning / error occurs) includes the name of the procedure and the line number the error occurred on. What more do you need? If you need additional context, like a customer ID, you probably know that already in the middle-tier / client code that is calling the stored procedure. Alternatively, the source of the error is a business rule that has failed and you're performing a custom RAISERROR anyway and are able to substitute values, like the customer ID, into the error message.

Ultimately, the entity in charge of the transaction must be the arbiter that decides whether the transaction needs to commit or rollback. This has to be the middle-tier code that is calling the stored procedure and not the stored procedure itself. The only situation I can think of where transaction control logic needs to reside in TSQL is when the entity in charge is Query Analyser / SQL Server Management Studio (SSMS), i.e. you're running SQL scripts and don't have the luxury of C# available and you need some actions to be performed if an error occurs. Likewise a stored procedure that will act as a Service Broker internal service program could benefit from TRY...CATCH as the stored procedure is the initiator.

Anyway, enough of my opinions. I'll be covering recommended error handling practises in a later instalment. Let's look at what TRY...CATCH brings to the table.

As you've probably guessed, TRY..CATCH is a TSQL implementation of the exception handling semantics we've grown to love in .NET / C++. A TRY block determines the scope of the TSQL statements that will be monitored for RAISERRORs (caused by either SQL Server or custom RAISERRORs). If an error is raised the TRY block is immediately exited and control is passed to the corresponding CATCH block. TRY..CATCH blocks can be nested in the same manner as C#. However, there is no concept of a FINALLY block. The following TSQL is an example of a simple TRY...CATCH block.

begin try
    -- Fails with a primary key violation
    insert Sales.Currency (CurrencyCode, [Name])
    values (N'GBP', N'United Kingdom Pound')

    print N'Will not be reached...'
end try
begin catch

    print N'In catch block...'
end catch

Running this in SSMS produces the following output:

In catch block...

The good thing about this behaviour is that we no longer need to include the standard @@ERROR test after any DML. In fact, if the code is within a TRY block we'll never get a chance to test @@ERROR anyway. 'Will not be reached...' is never printed. Also note that SSMS did not report any error. What? Where has my PK violation gone? To ensure that I wasn't seeing a quirk of SSMS I ran the following in a C# console application:

try
{
    using(SqlConnection con = new SqlConnection("Data Source=(local);Initial Catalog=AdventureWorks;Integrated Security=SSPI"))
    using SqlCommand cmd = con.CreateCommand())
    {
        con.Open();
        cmd.CommandType = CommandType.Text;
        cmd.CommandText = @"
begin try
    -- Fails with a primary key violation
    insert Sales.Currency (CurrencyCode, Name)
    values (N'GBP', N'United Kingdom Pound')

    print N'Will not be reached'
end try
begin catch
    print N'In catch block'
end catch"
;
        cmd.ExecuteNonQuery();
        Console.WriteLine("It all worked perfectly...");
    }
}
catch(Exception exception)
{
    Console.WriteLine(exception);
}

Again, no exception. So the TSQL CATCH is behaving exactly like exceptions in .NET. If you catch an exception in TSQL and don't re-raise the exception, the caller will never know that it occurred. Aaargh! Why would you ever want that behaviour in a middle-tier application? The golden rule with exception handling is: "Don't catch exceptions unless you can do something about them". When could you ever do something about an error in TSQL besides aborting the batch?

Anyway, if you ever decide to use TRY..CATCH in TSQL there are a couple of new functions that are available to inspect the exception, think of them as @@ERROR replacements. You now have ERROR_LINE(), ERROR_MESSAGE(), ERROR_NUMBER(), ERROR_PROCEDURE(), ERROR_SEVERITY(), ERROR_STATE() available inside the catch block. In addition, there is another function called XACT_STATE() available that tells you whether or not the transaction is capable of being committed. BOL states that if XACT_STATE returns -1 then an error has occurred in the active transaction that has rendered it uncommittable. In this state only read operations are allowed. A transaction will enter this state if XACT_ABORT has been set on and an error was raised.

I can see how TRY..CATCH will be useful if you are running pure TSQL scripts. However, for any database interactions that are initiated from application servers / clients I don't believe they should be used at all. One of the holy grails I rely on is that, in SQL Server 2000, no errors can ever be swallowed / suppressed in the database. The client always has full knowledge of all errors that occurred, even if there were multiple RAISERRORs (SQL Server initiated and / or custom errors - SqlException has a SqlErrors collection). Use of TRY..CATCH in these cases could hide these errors from the application server / client resulting in mysterious missing data / errors that the caller has no knowledge of. I feel that the use of TRY..CATCH would be on par with the evil that is catch(...) in C++!

I had a colleague suggest that as the new TRY...CATCH syntax can catch deadlocks (which are not possible to handle within TSQL in SQL Server 2000), you could move the deadlock retry code from the application servers into the database. Please don't do this unless you want applications that are randomly slow with transactions boundaries as fuzzy as Beyonce's afro!

Next up, SNAPSHOT_ISOLATION level for all the Oracle pundits.

Labels: ,

Thursday, 6 April 2006

The OVER Clause Part II

posted by Stefan Delmarco

The only mention that the OVER clause can be used with aggregate functions (SUM, COUNT, etc.) is in the following statement in BOL (Books Online):

The OVER Clause

It gives no hint or inkling of the wealth of possibilities this opens up for queries that need to contain a mix aggregate and non-aggregate results. Suppose you were asked to write a query that will return a result set of all customers' FirstName, LastName and EmailAddress. For each customer, return a count of how many other customers have the same FirstName and a count of how many have the same LastName. Restrict this to customers whose LastName begins with a 'D' to keep it short. In SQL Server 2000 you'd have to start thinking of sub-selects and correlated sub-queries to mix these aggregate (one grouping on FirstName the other on LastName) and non-aggregate (each customer row needs to be returned independent of any aggregate grouping) requirements. If the requirement was for a pure aggregation (e.g. count of customers grouping by LastName) it would be very straight forward. However, with the mix you'd probably come up with something that looks like one of these:

-- #1, Expensive...
select FirstName, LastName, EmailAddress,
    (select count(*)
    from Person.Contact Cinner
    where Cinner.FirstName = Couter.FirstName
     and Cinner.LastName like 'D%') as [CountOfFirstName],
    (select count(*)
    from Person.Contact Cinner
    where Cinner.LastName = Couter.LastName
     and Cinner.LastName like 'D%') as [CountOfLastName]
from Person.Contact Couter
where LastName like 'D%'
order by LastName, FirstName

-- #2 Better...
select FirstName, LastName, EmailAddress, CountOfFirstName, CountOfLastName
from Person.Contact C
    inner join
        (select FirstName as [FirstNameCounted],
            count(*) as [CountOfFirstName]
        from Person.Contact
        where LastName like 'D%'
        group by FirstName) as [CountFirstNames]
    on C.FirstName = CountFirstNames.FirstNameCounted
    inner join
        (select LastName as [LastNameCounted],
            count(*) as [CountOfLastName]
        from Person.Contact
        where LastName like 'D%'
        group by LastName) as [CountLastNames]
    on C.LastName = CountLastNames.LastNameCounted
where LastName like 'D%'
order by LastName, FirstName

#1 is twice as expensive as #2. Execute these queries and have a look at the execution plans to understand why (we'll covering analysis of execution plans in the future). Both these queries produce the following result set (abbreviated):

FirstName LastName EmailAddress CountOfFirstName CountOfLastName
--------- --------- ------------------------------ ----------------- ---------------
...
David Daniels david34@adventure-works.com 87 2
David Daniels david31@adventure-works.com 87 2
Ryan Danner ryan3@adventure-works.com 60 2
Ryan Danner ryan5@adventure-works.com 60 2
Mike Danseglio mike2@adventure-works.com 9 1
Alexander Davis alexander7@adventure-works.com 23 77
Alexandra Davis alexandra68@adventure-works.com 93 77
Alyssa Davis alyssa4@adventure-works.com 67 77
...

As a quick sanity check we can see that anyone with the surname Davis has CountOfLastName as 77 and anyone with the FirstName David has CountOfFirstName as 87.

In the same manner the OVER clause applies to the window ranking function that immediately precedes it, the OVER clause affects the aggregate function that precedes it. In addition, if there are multiple aggregate functions, each one can have its own OVER clause that can be different from the others. However, as per the documentation snippet from above, aggregate functions only include the PARTITION BY clause in their OVER clause. Unlike ranking functions, there is no concept of an ORDER BY for an aggregate function (this wouldn't have any meaning to an aggregation anyway). The same query can now be rewritten in SQL Server 2005 as follows:

select FirstName, LastName, EmailAddress,
    count(*) over(partition by FirstName) as CountOfFirstName,
    count(*) over(partition by LastName) as CountOfLastName
from Person.Contact
where LastName like 'D%'
order by LastName, FirstName

How cool is that?! This rocks! Not only does it read much better than the queries above, it is the cheapest to execute by a long long shot (65% cheaper than #1 and 45% cheaper than #2 - have a look at the execution plans). This query produces the identical results to the queries above. Note that there is not a GROUP BY to be seen. SQL Server has essentially substituted the GROUP BY that all aggregate functions need with the PARTITION BY clause. As SQL Server is producing the result set, it is streaming the rows through the aggregate functions. The PARTITION BY tells SQL Server how to group the rows when applying the aggregation function in the same manner the PARTITION BY was telling SQL Server when to restart the ranking function.

Note that if you need to write a query that produces a pure aggregation then the normal GROUP BY form needs to be used as you'll want the source result set collapsed to their distinct values that the aggregate functions iterated over. If you need to mix aggregate and non-aggregate functions then OVER with PARTITION BY is the best option. Next up is the new TRY...CATCH syntax.

Labels: ,

Monday, 20 March 2006

The OVER Clause Part I

posted by Stefan Delmarco

The OVER clause is new to SQL Server 2005. It is used for aggregate (SUM, AVG, COUNT, etc.) and ranking (NTILE, ROW_NUMBER, RANK, etc.) functions. Previously we've seen it used to specify the sorting SQL Server needs to apply to a result set before evaluating its associated window function for each row. For example, the following query results a list of all vendors, their post codes and cities. Each row is assigned a ROW_NUMBER according to its absolute position when the rows are ordered according to their PostalCodes.

select row_number() over(order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by PostalCode;

RowNumber Name PostalCode City
--------- ---------------------------------- ---------- -------------
1 Premier Sport, Inc. 02113 Boston
2 A. Datum Corporation 10007 New York
3 Morgan Bike Accessories 12210 Albany
4 Wood Fitness 19107 Philadelphia
5 Competition Bike Training Systems 30308 Atlanta
6 Bike Satellite Inc. 37501 Downey
7 Midwest Sport, Inc. 48226 Detroit
...

Now, suppose I didn't want the row number to be consecutive over the entire result set. For example, I may want to have the rows numbered from 1 to n for each city. That is, I want the row numbers to be consecutive (1,2,3...n) per city. This is what the PARTITION BY clause allows us to do. By including a PARTITION BY clause in the window function's OVER clause we can essentially restart the function when the value of the column specified in the PARTITION BY changes. For example, to number the rows consecutively per city the query changes to:

select row_number() over(partition by City order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by City;
RowNumber Name PostalCode City
--------- --------------------------------- ----------- -----------
1 Morgan Bike Accessories 12210 Albany
2 Leaf River Terrain 97321 Albany
1 Allenson Cycles 91001 Altadena
2 Gardner Touring Cycles 91001 Altadena
1 Speed Corporation 98221 Anacortes
2 Electronic Bike Co. 98221 Anacortes
3 Northern Bike Travel 98221 Anacortes
1 Competition Bike Training Systems 30308 Atlanta
...

As you can see, the ROW_NUMBER function has restarted its numbering as soon as the City changed. The PARTITION BY clause can contain any number of columns available in the tables in the FROM clause. As PARTITION BY defines its arguments as an expression, we can start getting creative in how we perform the partitioning. For example, the following query partitions the row number according to the second letter of the address:

select row_number() over(partition by substring(City, 2, 1) order by PostalCode) as [RowNumber],
    [Name], PostalCode, City
from Person.Address A
    inner join Purchasing.VendorAddress VA
        on A.AddressId = VA.AddressId
    inner join Purchasing.Vendor V
        on VA.VendorId = V.VendorId
order by substring(City, 2, 1);

RowNumber Name PostalCode City
--------- --------------------------------- ----------- -----------
...
15 Bicycle Specialists 97034 Lake Oswego
16 Vista Road Bikes 97301 Salem
17 Knopfler Cycles 97301 Salem
18 Recreation Place 97301 Salem
19 Reliance Fitness, Inc. 98107 Ballard
20 Electronic Bike Repair & Supplies 98403 Tacoma
21 Cycling Master 99362 Walla Walla
1 A. Datum Corporation 10007 New York
2 Midwest Sport, Inc. 48226 Detroit
3 Green Lake Bike Company 80203 Denver
4 Beaumont Bikes 83301 West Covina
5 Greenwood Athletic Company 85252 Lemon Grove
...

ROW_NUMBER restarted its numbering as soon as the second letter of the City name changed from 'a' to 'e'. All of the above applies equally to RANK / DENSE_RANK and NTILE. The OVER clause can also be used with aggregate functions (like MIN, MAX, AVG, SUM, etc) which makes mixing aggregations and non-aggregations in a single query a lot easier than before. That's what we'll be looking at next.

Labels: ,

Thursday, 9 March 2006

The RANK and DENSE_RANK Ranking Functions

posted by Stefan Delmarco

RANK and DENSE_RANK are the last of the new ranking (window) functions. The best analogy of how these functions work is to think of how you'd solve the following problem: You've been asked to write a report that will generate a list of the best selling products. You're required to produce a result set that contains the product name, total quantity sold and assign each product a ranking (i.e. 1st, 2nd, 3rd, etc.). So you might write something like this:

select P.[Name],
    sum(TH.Quantity) as [TotalQuantity],
    row_number() over(order by sum(quantity) desc) as [Ranking]
from Production.TransactionHistory TH
    inner join Production.Product P
        on TH.ProductID = P.ProductID
group by P.[Name]
order by [TotalQuantity] desc;

Name TotalQuantity Ranking
----------------- ------------- --------------------
BB Ball Bearing 374090 1
Seat Stays 187376 2
Blade 93688 3
Chain Stays 93688 4
Fork End 93688 5
HL Crankarm 64900 6
HL Hub 56264 7
Decal 1 56250 8
Decal 2 56250 9
LL Mountain Pedal 50922 10
...

Whilst this looks to have fulfilled the requirements, it has a very subtle deficiency. If you look closely you'll see that the top 10 products actually include a couple of ties. Blade, Chain Stays, and Fork End all come in at an equal 93,688 sold. By using ROW_NUMBER to assign the ranking we have failed to take account the possibility of ties and assigned lower rankings to products that have sold equal amounts. Imagine if this query was for assigning prize money in a golf tournament! I'm sure the competitors would not be very happy about this 'feature'!

This is where the RANK function comes into play. In the same manner as all the previous windowed functions we looked at, RANK requires an OVER clause. RANK uses the ORDER BY in the OVER clause to assign a monotonically increasing value to each row, starting from 1. However, RANK looks for duplicates in the range. Where duplicate rows are found, the same RANK value is assigned to each row. In addition, when parsing duplicates, RANK counts the number of duplicates read and assigns the next row, that has a differing value, a RANK equal to the RANK of the duplicates plus the number of duplicates read. Using RANK our query then becomes:

select P.[Name],
    sum(TH.Quantity) as [TotalQuantity],
    rank() over(order by sum(quantity) desc) as [Ranking]
from Production.TransactionHistory TH
    inner join Production.Product P
        on TH.ProductID = P.ProductID
group by P.[Name]
order by [TotalQuantity] desc;

Name TotalQuantity Ranking
----------------- ------------- --------------------
BB Ball Bearing 374090 1
Seat Stays 187376 2
Blade 93688 3
Chain Stays 93688 3
Fork End 93688 3
HL Crankarm 64900 6
HL Hub 56264 7
Decal 1 56250 8
Decal 2 56250 8
LL Mountain Pedal 50922 10
...

As you can see, RANK has correctly identified the 'ties' and assigned Blade, Chain Stays and Fork End an identical RANK. In addition, it has correctly assigned HL Crankarm a RANK of sixth as we have 3 products that came in third.

DENSE_RANK behaves identical to RANK except that it does not increment the rank each time a duplicate is read. Using RANK we have no fourth or fifth placed products as the 3 third placed products have hidden the fourth and fifth placed ranks (i.e. we have non-consecutive ranks). DENSE_RANK does not skip ranks in this manner. Changing our query to use DENSE_RANK becomes:

select P.[Name],
    sum(TH.Quantity) as [TotalQuantity],
    dense_rank() over(order by sum(quantity) desc) as [Ranking]
from Production.TransactionHistory TH
    inner join Production.Product P
        on TH.ProductID = P.ProductID
group by P.Name
order by [TotalQuantity] desc;

Name TotalQuantity Ranking
------------------ ------------- --------------------
BB Ball Bearing 374090 1
Seat Stays 187376 2
Blade 93688 3
Chain Stays 93688 3
Fork End 93688 3
HL Crankarm 64900 4
HL Hub 56264 5
Decal 1 56250 6
Decal 2 56250 6
LL Mountain Pedal 50922 7
...

Next we'll explore the OVER clause in a bit more detail to see how partitions work.

Labels: ,

Saturday, 18 February 2006

The ROW_NUMBER Ranking Function

posted by Stefan Delmarco

Given the simplicity of this function it is surprising that we've never had it before! All that this function does is label each row with a monotonically increasing row number. Think of it as a dynamic IDENTITY column.

This function falls within the new ranking functions we started covering in the previous article. Just like NTILE, the ROW_NUMBER function needs an OVER clause that contains an ORDER BY. Each row is given a number according to its absolute position in the order specified in the OVER clause.

For example, the following query returns all First / LastNames where the LastName begins with the letter 'D'. The first row will have a RowNumber value of 1, the next 2, and so forth:

select row_number() over(order by LastName) as [RowNumber],
    FirstName, LastName
from Person.Contact
where LastName like 'D%';

Whilst this may seem very straight forward, the real intention of this function is to make it easy to implement paging in the database. That is, the ability to return a portion of the result set as specified by the number of the first row to return, and a page size (a'la Google-style). Just like the NTILE function, the ROW_NUMBER function cannot be used in the WHERE clause. To filter on ROW_NUMBER a derived table needs to be used:

declare @startRow int;
declare @pageSize int;
set @startRow = 91;
set @pageSize = 10;

select FirstName, LastName
from (
    select row_number() over(order by LastName) as [RowNumber],
        FirstName, LastName
    from Person.Contact
    where LastName like 'D%') as Derived
where RowNumber between @startRow
and @startRow + @pageSize - 1;

As a test I wanted to see how clever the query optimizer would be at restricting the number of rows returned. Typically you page data over an enormous results set (think of the contents of a telephone directory). I wanted to make sure that the query optimizer didn't assign a ROW_NUMBER to every single row in the underlying result set that is being paged. The first time I ran the query I found that the execution plan first performed a clustered index scan of the entire Person.Conact table to find all LastNames that begin with D (there are 556 LastNames that begin with 'D'):

Execute Plan Before

This is very inefficient as the optimizer has to chew through 556 rows just to return 10! However, once I added a non-clustered index on the LastName column, the change in the query plan was astounding!

create nonclustered index IX_Contact_LastName
on Person.Contact (LastName asc)

Instead of blindly following the sub-select derived table filter that the first execution plan produced, the revised execution plan consisted of two distinct parts. The first part performed an index seek (using IX_Contact_LastName) to find the first 100 rows in the Person.Contact table that began with the letter 'D'. 100...? Well, we specified a @startRow of 91 and @pageSize of 10. So SQL Server is searching for the last row in the result set where the LastName begins with 'D', ordered by LastName (as we specified in the OVER clause).

Execution Plan After

The second part of the query is where SQL Server joins the rows that fall within our criteria (from part 1) back to the table (part 2 which uses the clustered PK_Contact_ContactID index) to retrieve the FirstNames for the LastNames we are returning. This is brilliant! SQL Server has understood what the query is trying to perform holistically. It has restricted the number of rows to evaluate to the last row the query is interested in instead of the entire result set! Under the skin the query optimiser has implemented a very clever TOP 100 clause.

So, the ROW_NUMBER function is extremely easy to use. However, care still needs to be taken when deciding how to order and filter the rows to be returned, as a simple query does not necessarily equate to an efficient execution plan. Just because you can page easily don't forget to make sure you have the underlying database structures to perform that paging efficiently. The hard-learned rules of query optimisation through indexing are still very relevant.

Labels: ,

Tuesday, 7 February 2006

The INTERSECT and EXCEPT Operators

posted by Stefan Delmarco

The INTERSECT and EXCEPT operators are extensions to the TSQL language that implement left semi-join and left anti semi-joins respectively. This is not a new capability in the sense that these semi-joins were not possible before. They are merely an alternatively to expressing a set-based operation using an alternative syntax.

The best way to visualise the new INTERSECT and EXCEPT operators is to see them as extensions to the UNION (ALL) operator. The UNION operator allows you to vertically join the results of multiple SELECT statements into a single result set. For example (using the new AdventureWorks database), the following query returns all FirstNames where the LastName starts with 'A' concatenated with all FirstNames where the LastName starts with 'D':

select FirstName
from Person.Contact
where LastName like 'A%'
    union all
select FirstName
from Person.Contact
where LastName like 'D%';

This is a bit of a trivial example as you could have used an OR in a single WHERE clause but it illustrates the concept. Whereas UNION simply concatenates the two result sets (the ALL keyword prevents duplicates from being eliminated), INTERSECT and EXCEPT will only return distinct rows that either exist in both result sets or in only in the first (also known as the left) one. For example, the following query returns all FirstNames that are used by LastNames that begin with 'A' and by LastNames that begin with 'D'. In the example, Aaron is returned in the result set as there is an Aaron Alexander, Aaron Adams, Aaron Allen and Aaron Diaz.

select FirstName
from Person.Contact
where LastName like 'A%'
    intersect
select FirstName
from Person.Contact
where LastName like 'D%';

The equivalent in SQL Server 2000 would have looked like this:

select distinct PC1.FirstName
from Person.Contact PC1
    inner join Person.Contact PC2
        on PC1.FirstName = PC2.FirstName
where PC1.LastName like 'A%'
 and PC2.LastName like 'D%';

Also, note that the INTERSECT operator is symmetric in the sense that the same result set is produced if the queries are interchanged. The INTERSECT query below produces the same result set as the INTERSECT query above:

select distinct FirstName
from Person.Contact
where LastName like 'D%'
    intersect
select FirstName
from Person.Contact
where LastName like 'A%';

I'm sure you're already guessing what the EXCEPT operator does. However, before you jump to the conclusion that EXCEPT does the opposite of INTERSECT, read the fine print! The EXCEPT operator is a left anti-semi join. This means that it will return rows that exist in the first result set but not in the second. Note that will not include rows that appear in the second result set but not in the first.

The following query returns the FirstNames that belong to LastNames that begin with 'A' but do not belong to LastNames that begin with 'D'. In the example 'Heidi' is returned as there is only a Heidi Arun. There is no Heidi that has a LastName that begins with the letter 'D'.

select FirstName
from Person.Contact
where LastName like 'A%'
    except
select FirstName
from Person.Contact
where LastName like 'D%';

The equivalent in SQL Server 2000 is as follows:

select distinct PC1.FirstName
from Person.Contact PC1
    left outer join Person.Contact PC2
        on PC1.FirstName = PC2.FirstName
        and PC2.LastName like 'D%'
where PC1.LastName like 'A%'
 and PC2.FirstName is null;

As you can see, the equivalent for the EXCEPT operator is the classic 'find the missing rows' query (LEFT OUTER JOIN with an IS NULL on the JOIN'd column).

Whilst the equivalent query using the EXCEPT and INTERSECT operators is a bit more compact, the jury is still out in terms of the usefulness of these new INTERSECT / EXCEPT operators. Understanding the old-style syntax has taught some very valuable lessons as it requires a developer to comprehend the difference between INNER and OUTER joins, the difference between specifying filters in JOINs or WHERE clauses, and the real meaning of NULL. All of these are lessons that serious, industrial-strength developers need in order to deliver performant and scalable database implementations. I would only advocate the use of EXCEPT / INTERSECT if it produces a superior execution plan. So, if you find yourself writing a query that is attempting to correlate two result sets, give EXCEPT / INTERSECT a try!

Labels: ,

Sunday, 22 January 2006

The NTILE Ranking Function

posted by Stefan Delmarco

NTILE, RANK, DENSE_RANK and ROW_NUMBER form part of a new set of functions, called 'Ranking Functions' (or 'Ranking Window Functions') that use the new OVER operator. We'll just cover NTILE in this instalment and work our way through the rest in subsequent articles.

These functions have been added to TSQL to make features like paging and categorising data much easier. Anyone that's tried to do Google-style paging in TSQL knows that this can be a bit challenging when data sets get large. There are a couple of neat tricks in SQL Server 2000 to do this efficiently but with SQL Server 2005 this is now a walk-in-the-park.

The ranking functions all return integers whose values and ordering are determined by the context supplied to the OVER operator. The OVER clause determines the ordering the function needs to apply to the result set when it 'ranks' (i.e. evaluates) each row. Typically the OVER clause contains an ORDER BY statement but can include a PARTITION as well (which we'll ignore for now but cover in a later instalment).

Use of NTILE and OVER is best illustrated with an example. Suppose I wanted to return a result set and have the the result set split into equally-sized logical groups. That is, I want a number associated with each row that tells me which group a specific row belongs in (1..n where n is the number of group). I don't want to specify a page size for these group. I just want the rows divided equally.

declare @numberOfGroups int;
set @numberOfGroups = 5;

select ntile(@numberOfGroups) over(order by FirstName) as [GroupNumber],
    FirstName, LastName
from Person.Contact
where LastName like 'D%';

When you run this query you'll see that SQL Server has returned a number in the first column that represents the group number of that row. SQL Server has taken the total number of rows returned from the query and divided it into the specified number of groups (which could have been hardcoded) and then numbered the groups from 1 to n, where n is the @numberOfGroups. If the number of rows cannot be divided equally amongst the number of groups the lower numbered groups will have more rows in them (at most 1 more).

The OVER clause determines how SQL Server orders the rows when it divides them into groups. You can perform the grouping in one order and return the result set in another one. For example:

declare @numberOfGroups int;
set @numberOfGroups = 5;

select ntile(@numberOfGroups) over(order by FirstName) as [GroupNumber],
    FirstName, LastName
from Person.Contact
where LastName like 'D%'
order by LastName desc;

In this case the GroupNumber column will appear to be jumbled as SQL Server has assigned the rows into groups according to FirstName but then sorted the rows on LastName before returning the result set. If you just want to return a specific group number you need to use a derived table (or Common Table Expression / CTE - stay tuned). Windowed functions are not allowed to appear in the WHERE clause as this would be ambiguous (does SQL do the grouping before / after / during all the joins and filters...?)

declare @numberOfGroups int;
set @numberOfGroups = 5;
declare @groupToReturn int;
set @groupToReturn = 3;

select FirstName, LastName
from (
    select ntile(@numberOfGroups) over(order by FirstName) as [GroupNumber],
        FirstName, LastName
    from Person.Contact
    where LastName like 'D%') as Grouped
where GroupNumber = @groupToReturn;

This query just returns the third group - very simple paging / chunking. Next we'll have a look at ROW_NUMBER(), the single most requested feature after SNAPSHOT_ISOLATION!

Labels: ,

Tuesday, 17 January 2006

SQL Server 2000 Partitioned Views

posted by Stefan Delmarco

Partitioned views are a feature in SQL Server 2000 that allows an additional level of abstraction to be introduced between the orthogonal view of the data and its storage in an underlying table. In this article we explore what a partitioned view is, how it is created and accessed and, most importantly, what advantages it brings to the table.

More...

Labels: ,

Tuesday, 27 December 2005

Delivering on SQL Server

posted by Stefan Delmarco

Having been at the sharp end of many successful SQL Server solutions, I believe there are some fundamental principles that all successful SQL Server installations have in common. Whilst the challenges in delivering an enterprise-scale database-intensive application can be wide and varied, there are a few basic practises that will give you the best possible chance of success.

Play to the Strengths of the Platform

Yes, SQL Server is an RDBMS (Relation Database Management System). Yes, there are many things that all RDBMSs (Oracle, DB2, etc.) share in common. They all have tables, they all have indexes, and they all largely implement the same SQL language functionality. However, each vendor's implementation will have its own strengths and weaknesses. The key to success with your chosen RDBMS is to understand what exactly these strengths and weaknesses are.

SQL Server can either be that incredibly powerful, reliable platform on which your mission-critical applications can depend, or it can be that RDBMS your developers and DBA are always fighting to keep their applications just about running. These are both running the one-and-the same SQL Server product. The difference is that the first is being used in a manner that SQL Server was designed for. The developers understand the features available in SQL Server / TSQL and exploit them at every opportunity. They can read and interpret a query plan and know their way around Query Analyzer. In the second instance the developers are database-agnostic. They write their queries without regard for scalability, use a procedural approach for their TSQL instead of set-based, do not have SQL Server installed on their local developer workstations and have never opened up SQL Server's online help (BOL - Books Online Download / MSDN).

Just because a developer has prior experience in Access or Oracle doesn't mean that the experiences gained working with those RDBMSs translate directly to SQL Server. In fact I once had a developer ask me why we "don't just use NTEXT columns wherever we use NVARCHAR, you know, just like Memo in Access?". It is the responsibility of any developer or DBA working with SQL Server to understand how SQL Server works and what it does well, in detail.

If you don't know the difference between a clustered and non-clustered index then you need to pull up BOL and make sure you find out. Did you know that most of SQL Server's internals are actually included in the online documentation? In my opinion, BOL is probably more thorough, better written and has more extensive examples than most of the SQL Server books sold on Amazon!

As an example what it means to play to the strength of the platform let me tell you about a customer that implemented a complicated equity management system entirely in DTS! When all the data was in the right place it ran like a dream. However, when there was a problem it was the worst application in the world to try and debug. I'm pretty sure that the designers of DTS never intended it to be used in this manner (opening up one of these DTS packages was like looking at a map of the London Underground!). In this instance the customer was definitely not playing to the strengths of the platform and ended up experiencing a significant measure of pain before being forced to learn this lesson.

Embrace Microsoft Integration

I'm not talking about open-standards or interoperability here. I'm talking about not introducing plumbing complexity within an application. We all know how hard it is to connect disparate systems together, disparate sub-systems or components are orders of magnitude more troublesome as they often struggle to map cleanly into your application. Keep that kind of complexity out of your system.

Look at it this way, there is no other vendor that can give you all of the end-to-end development tools you need to build a successful solution. How much of it is ready-made (e.g. BizTalk Server) and how much of it you need to build from scratch obviously depends on the specifics of what you're delivering. This is the beauty of delivering on the Microsoft platform. Whether you're writing device drivers, an order processing system or a screen saver, the tools are the same, the documentation is the same, the support website is the same.

Don't let your design or code be constrained by trying to keep your SQL vendor agnostic. Unless you have a very specific requirement to deliver a vendor-independent database design, don't choose ANSI-compliance above TSQL-specific features. Even then, recognise and acknowledge that you will be sacrificing performance to achieve vendor independence. You may even be unable to deliver scalability in conjunction with such a restrictive requirement.

Don't build a patchwork system unless you are prepared to allocate the time and budget to play Luigi-the-plumber. Keeping it homogenous will put you in a position where you can concentrate on the design of the solution rather than having to worry if all the plumbing fixtures are going to play nicely together. Remember, any weakness in any piece of the complex jigsaw that makes up an enterprise solution determines what the lowest common-denominator is that the solution must be built on.

At worst you'll have a single vendor to deal with to sort out your problems. As an ex-support engineer it was vital to be able to reproduce a customer's problem in order to arrive at a solution. Cases that used Microsoft technologies were much easier to progress and solve than those that required inter-vendor cooperation.

At best you'll be free to concentrate on delivering business value knowing that connectedness is taken care of.

If you are building an enterprise-scale 3-tier system, keep it Microsoft. I have delivered enough successful enterprise-scale system to be able to say that with confidence!

It All Depends...

This is the answer I hate to give to some of the SQL Server design questions I am asked. Unfortunately it is the truest. There are very few hard and fast rules when it comes to which database design is appropriate for your application. Questions like "Do I need a clustered index?" or "What is the fastest way to load data into my database?" can never get a simple answer.

So why does it always have to depend? Different features of SQL Server scale differently depending on how they're being used. For example, bulk loading data into SQL Server will always be fastest initially when the table has no indexes, till it gets to a certain size. Then factors like the width of a row, the number of PFSs (Page Free Space) pages spanned by the table start to become a factor. Even then, there are no hard-and-fast rules when these other factors come into play.

So how do go about making a decision here? The simple answer is to try it out. All of the projects I have worked on that delivered successfully spent a significant time upfront, and sometimes during delivery, trying out various permutations of the design under simulated conditions to determine the feasibility, complexity and scalability of various options. That is the only way to validate the design and progress with confidence.

Keeping these types of test rigs around after a decision has been made is always useful for when service packs or hotfixes need to be tested. In addition, you never know when you may get that "Aha!" moment and want to try out something new you've just thought of!

Develop with Production-Level Data Load

Developers like to write the code only once, so you might as well make sure they develop it using a development database that contains the same data volume as is expected in production. Often this can be costly to the business as SAN disk space can get pricey. However, IDE SATA disks are now so cheap that you can pick up a 250 GB HDD for about £70. There really is no excuse...

If you aren't able to develop with a representative data load you'll catch your performance problems either during testing (you are doing stress/soak testing right...?), or worse, in the production environment. By that stage the cost of the changes required to the application may be prohibitive either in terms of dollars or risk. This is where the sticky tape and plasters start to make their first appearance.

Keep Your War Chest Current and Well Stocked

Know your tools and know them well! I had the pleasure of working with a Microsoft developer support engineer that had more than 14 years of dealing with the most critical development issues under his belt. Whenever I discussed a problem with him he always came up with some command or technique in the Windows debugger (WinDbg) to help further the analysis of the problem. He never said "The problem is caused by XYZ". It was always "Give this a go and see where it takes you".

This engineer had the most extensive bag of tricks I have ever seen! He had an incredible knowledge and understanding of diagnostic tools and techniques. The net result was that in all his years he never came across a support case / issue that he could not solve. Sure, some of the cases ran for weeks but he never ran out of options or techniques to progress the issue. In addition, he never allowed himself to be rushed. He took his time to perform the proper analysis of the output from his investigation and then decide on the next step.

I have a directory in My Documents called Toolbox. In there I have every tool I have ever found useful. Here are my top 5:

  • Windows Debugger / WinDbg (nothing can hide from the debugger!)
  • PsTools (pslist \\RemoteServer -s, some Anti-Viruses don't like psexec)
  • Process Explorer (best feature is being able to drill down into process / thread performance counters and take a quick peek at the stack)
  • TCPView (easiest network monitoring tool around)
  • Process Monitor (your best friend when debugging COM)

Having mastery over your tools not only makes you a more effective trouble shooter, it also serves as a learning experience that, in turn, allows you to play to the strengths of the platform.

Hopefully this article has given an idea of what we feel are important practices in the successful delivery of a SQL Server solution.

Good luck!

Labels:


Previous Posts