Fotia Ltd

Fotia Ltd

Practical Uses of Partitioned Views

Traditional Partitioning

Almost all of the examples of partitioned views out there apply the technology to either partitioning tables by a date range (i.e. per-year or per-quarter) or by the name of an entity (e.g. customer name). These examples aren't very interesting and I've rarely used a partitioned view in this manner. If you plan on using partitioned views in this manner then there are plenty of "Hello World" examples to help you. Let's have a look at some unorthodox implementation of partitioned views.

Large Skewed Table

This technique applies the partitioned view technology to a somewhat removed problem where a little lateral thinking goes a long way! The problem is this: you have a table that has grown in size to a point where managing such a large amount of data as a single entity starts to become a problem. Time taken during an upgrade to add / remove / change columns is measured in hours instead of minutes. In addition, bad query plans start to show up due to the table's skewed cardinality in its indexed columns. Skewed indexed columns will return a big variation in the number of rows for each unique value / combination of values in the index.

For example, let's say that the table contains orders. The clustered key consists of order number and order line item number. Some orders contain hundreds of thousands of order line items whereas other orders contain just one or two. Therefore the optimal execution plan is heavily dependant on the actual order number used in a query. If SQL Server's statistics have sampled the order number being used then SQL Server will generate a good plan. If SQL Server has to estimate the number of rows returned (as its statistics do not include that order number) then you sometimes get a good plan and sometimes get a bad plan depending on how good the guess was (if the plan was being recompiled otherwise a cached plan will be used). The goal then becomes to keep the statistics up-to-date. SQL Server does a pretty good job with keeping statistics up to date when the cardinality of the indexed columns is uniform. SQL Server can get away with only sampling a small subset of the rows and extrapolating the results. However, with our large skewed table, SQL Server needs to sample every row in order to make the '100,000 vs. 1' decision.

So next you decide to update the statistics every evening using the WITH FULLSCAN option to make sure that the statistics include every single row. However, as data gets added to the table during the day, the optimizer is back playing guessing games with this new data. Sometimes it works out, sometimes an irate user calls you telling you that the database is slow. Setting the statistics to auto-update is not the answer either as the threshold to cause an automatic update of the statistics is so large that it is highly unlikely to be reached during the day (see How SQL Server 7.0 and 2000 AutoStats Work). Even worse, if automatic update of statistics had to kick in during the day then the nicely 100% sampled statistics would be blown away by new statistics that only sampled a fraction of the rows.

Looks pretty dire, doesn't it? Well, enter stage left partitioned views. What we'd like is one table that starts out empty at the start of every day. This is the volatile 'lean-and-mean' table that gets populated throughout the day. Call this the current table. The current table is sufficiently small to allow auto stats to sample a good number of the rows. In addition, as the table is growing in huge increments (when compared to its total size) the statistics will always be fresh. On the other hand we have a table that is the static 'big-and-heavy' table that contains all of the historic data, up to and including yesterday's data. Call this the archive table.

At the end every day, during the batch window, all of the data is transferred from the current table to the archive table. Once the transfer is complete, the statistics on the archive table are updated using the WITH FULLSCAN option. The archive table might be updated during the day but the indexed columns are never touched (you know that it's a sin to update any columns in the clustered key - right?). Therefore, during the day, the current table grows with statistics being continually updated and the archive table keeps its 100% sampled statistics. Whilst you could get away with updating your code to be aware of both tables, this would complicate the code needlessly. Adding a partitioned view over these two tables is just the right fit as it allows you treat the current and archive tables as a single entity.

The partitioned view over the current and archive table is updatable so your code is able to manipulate the partitioned view as if it was just a single large table. However, the one caveat is that this strategy does not lend itself very well to choosing a partitioning key. Essentially we want to partition 'manually' every evening as we move data from current to archive. We could get away with using just a normal view, but that would not make the view updatable. It has to adhere to the rules that govern partition views. One of these rules is that the partitioning column forms part of the primary key. A neat trick I've used on a number of occasions is to use an identity column as the partitioning key. According to the partitioned view rules this is not allowed. BOL states that "The partitioning column cannot be an identity, default or timestamp column". However, with some creative thinking and some bending of the rules we can get this to work like a charm:

  • Add an Id column to both current and archive tables.
  • Seed current's identity at 1 and get it to increment.
  • Seed archive's identity at -1 and get it to decrement.
  • Add a CHECK constraint to both columns asserting that current's Id will always be positive and archive's Id will be negative.
  • Set the Id column to be the primary key on tables. Make sure you leave the index on the 'real' primary key.

Using this technique we have bypassed this rule. However, it does come at a price, albeit a small one. The partitioned view cannot be inserted into directly. This makes sense if you think about it. How would SQL Server know which underlying table to insert into anyway as the values in the identity columns are generated after the row is inserted into the underlying table? There are two ways to address this. The first option is to update your code that inserts into the partitioned view and get it to insert into the current table instead. The second option is to put an INSTEAD OF trigger on the view that inserts into the current table. The first option gives better performance but means you might have to touch loads of code. The second option is slightly slower but means you don't have to update any code. With the second option you'll need to specify a value for the partitioning column even though it will not be used. It is only required for the SQL query processor to be satisfied that all the non-nullable columns have been specified. The INSTEAD OF trigger will just ignore this value.

Besides this small caveat, the partitioned view is fully updateable. You can update / delete data in the partitioned view to your heart's content!

This TSQL script demonstrates the technique with a worked migration example. The following script represents the table as it was before being split:

-- Our big large table...
create table BigLargeTable (
    orderId int not null,
    orderLineItemId int not null,
    someData varchar(100))
go

-- The current primary key...
alter table BigLargeTable
add constraint PK_BigLargeTable
primary key clustered (orderId, orderLineItemId)
go

The following script splits the table into current and archive using the technique described above.

-- Rename the BigLargeTable to Archive.
exec sp_rename 'BigLargeTable', 'BigLargeTable_Archive', 'object';
go
-- Drop the existing PK and replace it with a unique clustered index.
alter table BigLargeTable_Archive
    drop constraint PK_BigLargeTable;
go
create unique clustered index IX_BigLargeTable_Archive
    on BigLargeTable_Archive (orderId, orderLineItemId);
go

-- Add the partitioning column...
alter table BigLargeTable_Archive
    add partitionId int not null identity (1, 1)
    check (partitionId > 1);
go

-- Add the new primary key...
alter table BigLargeTable_Archive
    add constraint PK_BigLargeTable_Archive
    primary key nonclustered (partitionId);
go

-- Create the Current table...
create table BigLargeTable_Current (
    orderId int not null,
    orderLineItemId int not null,
    someData varchar(100),
    partitionId int not null identity (-1, -1));
go
create unique clustered index IX_BigLargeTable_Current
    on BigLargeTable_Current (orderId, orderLineItemId);
go
-- Add the new primary key...
alter table BigLargeTable_Current
    add constraint PK_BigLargeTable_Current
    primary key nonclustered (partitionId);
go

-- Create the partitioned view, keep the name the same as the table we are splitting...
create view BigLargeTable with schemabinding
as
    select orderId, orderLineItemId, someData, partitionId
    from dbo.BigLargeTable_Current
    union all
    select orderId, orderLineItemId, someData, partitionId
    from dbo.BigLargeTable_Archive;
go
-- Instead of trigger for option 2...
create trigger BigLargeTable_insert
    on dbo.BigLargeTable
    instead of insert
as
    insert BigLargeTable_Current (
        orderId, orderLineItemId, someData)
    select orderId, orderLineItemId, someData
    from inserted;
go

<<< Previous  Next >>>