Practical Uses of Partitioned Views
Traditional Partitioning
Almost all of the examples of partitioned views out there apply the technology to
either partitioning tables by a date range (i.e. per-year or per-quarter) or by
the name of an entity (e.g. customer name). These examples aren't very interesting
and I've rarely used a partitioned view in this manner. If you plan on using partitioned
views in this manner then there are plenty of "Hello World" examples to
help you. Let's have a look at some unorthodox implementation of partitioned views.
Large Skewed Table
This technique applies the partitioned view technology to a somewhat removed problem
where a little lateral thinking goes a long way! The problem is this: you have a
table that has grown in size to a point where managing such a large amount of data
as a single entity starts to become a problem. Time taken during an upgrade to add
/ remove / change columns is measured in hours instead of minutes. In addition,
bad query plans start to show up due to the table's skewed cardinality in its indexed
columns. Skewed indexed columns will return a big variation in the number of rows
for each unique value / combination of values in the index.
For example, let's say that the table contains orders. The clustered key consists
of order number and order line item number. Some orders contain hundreds of thousands
of order line items whereas other orders contain just one or two. Therefore the
optimal execution plan is heavily dependant on the actual order number used in a
query. If SQL Server's statistics have sampled the order number being used then
SQL Server will generate a good plan. If SQL Server has to estimate the number of
rows returned (as its statistics do not include that order number) then you sometimes
get a good plan and sometimes get a bad plan depending on how good the guess was
(if the plan was being recompiled otherwise a cached plan will be used). The goal
then becomes to keep the statistics up-to-date. SQL Server does a pretty good job
with keeping statistics up to date when the cardinality of the indexed columns is
uniform. SQL Server can get away with only sampling a small subset of the rows and
extrapolating the results. However, with our large skewed table, SQL Server needs
to sample every row in order to make the '100,000 vs. 1' decision.
So next you decide to update the statistics every evening using the WITH FULLSCAN
option to make sure that the statistics include every single row. However, as data
gets added to the table during the day, the optimizer is back playing guessing games
with this new data. Sometimes it works out, sometimes an irate user calls you telling
you that the database is slow. Setting the statistics to auto-update is not the
answer either as the threshold to cause an automatic update of the statistics is
so large that it is highly unlikely to be reached during the day (see
How SQL Server 7.0 and 2000 AutoStats Work). Even worse, if automatic update
of statistics had to kick in during the day then the nicely 100% sampled statistics
would be blown away by new statistics that only sampled a fraction of the rows.
Looks pretty dire, doesn't it? Well, enter stage left partitioned views. What we'd
like is one table that starts out empty at the start of every day. This is the volatile
'lean-and-mean' table that gets populated throughout the day. Call this the current
table. The current table is sufficiently small to allow auto stats to sample a good
number of the rows. In addition, as the table is growing in huge increments (when
compared to its total size) the statistics will always be fresh. On the other hand
we have a table that is the static 'big-and-heavy' table that contains all of the
historic data, up to and including yesterday's data. Call this the archive
table.
At the end every day, during the batch window, all of the data is transferred from
the current table to the archive table. Once the transfer is complete, the statistics
on the archive table are updated using the WITH FULLSCAN option. The archive table
might be updated during the day but the indexed columns are never touched (you know
that it's a sin to update any columns in the clustered key - right?). Therefore,
during the day, the current table grows with statistics being continually updated
and the archive table keeps its 100% sampled statistics. Whilst you could get away
with updating your code to be aware of both tables, this would complicate the code
needlessly. Adding a partitioned view over these two tables is just the right fit
as it allows you treat the current and archive tables as a single entity.
The partitioned view over the current and archive table is updatable so your code
is able to manipulate the partitioned view as if it was just a single large table.
However, the one caveat is that this strategy does not lend itself very well to
choosing a partitioning key. Essentially we want to partition 'manually' every evening
as we move data from current to archive. We could get away with using just a normal
view, but that would not make the view updatable. It
has to adhere to the rules that govern partition views. One of these rules
is that the partitioning column forms part of the primary key. A neat trick I've
used on a number of occasions is to use an identity column as the partitioning key.
According to the partitioned view rules this is not allowed. BOL states that "The
partitioning column cannot be an identity, default or timestamp column".
However, with some creative thinking and some bending of the rules we can get this
to work like a charm:
- Add an Id column to both current and archive tables.
- Seed current's identity at 1 and get it to increment.
- Seed archive's identity at -1 and get it to decrement.
- Add a CHECK constraint to both columns asserting that current's Id will always
be positive and archive's Id will be negative.
- Set the Id column to be the primary key on tables. Make sure you leave the index
on the 'real' primary key.
Using this technique we have bypassed this rule. However, it does come at a price,
albeit a small one. The partitioned view cannot be inserted into directly. This
makes sense if you think about it. How would SQL Server know which underlying table
to insert into anyway as the values in the identity columns are generated
after the row is inserted into the underlying table? There are two ways
to address this. The first option is to update your code that inserts into the partitioned
view and get it to insert into the current table instead. The second option is to
put an INSTEAD OF trigger on the view that inserts into the current table. The first
option gives better performance but means you might have to touch loads of code.
The second option is slightly slower but means you don't have to update any code.
With the second option you'll need to specify a value for the partitioning column
even though it will not be used. It is only required for the SQL query processor
to be satisfied that all the non-nullable columns have been specified. The INSTEAD
OF trigger will just ignore this value.
Besides this small caveat, the partitioned view is fully updateable. You can update
/ delete data in the partitioned view to your heart's content!
This TSQL script demonstrates the technique with a worked migration example. The
following script represents the table as it was before being split:
-- Our big large table...
create table BigLargeTable (
orderId int
not null,
orderLineItemId int not null,
someData varchar(100))
go
-- The current primary key...
alter table BigLargeTable
add constraint PK_BigLargeTable
primary key clustered (orderId, orderLineItemId)
go
The following script splits the table into current and archive using the technique
described above.
-- Rename the BigLargeTable to Archive.
exec sp_rename
'BigLargeTable',
'BigLargeTable_Archive', 'object';
go
-- Drop the existing PK and replace it with a unique
clustered index.
alter table BigLargeTable_Archive
drop constraint PK_BigLargeTable;
go
create unique clustered index IX_BigLargeTable_Archive
on BigLargeTable_Archive
(orderId, orderLineItemId);
go
-- Add the partitioning column...
alter table BigLargeTable_Archive
add partitionId int not null identity
(1, 1)
check (partitionId >
1);
go
-- Add the new primary key...
alter table BigLargeTable_Archive
add constraint PK_BigLargeTable_Archive
primary key nonclustered
(partitionId);
go
-- Create the Current table...
create table BigLargeTable_Current (
orderId int
not null,
orderLineItemId int not null,
someData varchar(100),
partitionId int not null
identity (-1, -1));
go
create unique clustered index IX_BigLargeTable_Current
on BigLargeTable_Current
(orderId, orderLineItemId);
go
-- Add the new primary key...
alter table BigLargeTable_Current
add constraint PK_BigLargeTable_Current
primary key nonclustered
(partitionId);
go
-- Create the partitioned view, keep the name the same
as the table we are splitting...
create view BigLargeTable
with schemabinding
as
select orderId, orderLineItemId,
someData, partitionId
from dbo.BigLargeTable_Current
union
all
select orderId, orderLineItemId,
someData, partitionId
from dbo.BigLargeTable_Archive;
go
-- Instead of trigger for option 2...
create trigger BigLargeTable_insert
on dbo.BigLargeTable
instead of insert
as
insert BigLargeTable_Current
(
orderId, orderLineItemId, someData)
select orderId, orderLineItemId,
someData
from inserted;
go
<<< Previous Next >>>