Fotia Ltd

Fotia Ltd

Friday, 27 January 2006

Hash Indexes

posted by Stefan Delmarco

Hash indexes bring together a number of SQL Server 2000 features to deliver a technique to allow the indexing of wide columns.

Introduction

In a previous life I was working on a system where users were authenticated by certificates. When logging into the system the client was required to sign an XML document with their certificate and return the signed document to the application server. The middle-tier would then check the digest / signature and ensure it was valid. Once all the PKI bits were in order the application server needed to locate the user's credentials in the database.

The only way to uniquely identify a certificate (between re-issues when they expire) is by the distinguished name (DNames / DNs). This is the name of the entity whose public key the certificate identifies. For example, the distinguished name on my certificate might be 'CN=Stefan Delmarco, OU=Verisign, O=Fotia, S=London, C=GB'.

Indexing Wide Columns

The first problem we encountered was defining the length of the field in the database to store the DName. We had to make sure we would never truncate the DName.

Like any risk-averse developers we chose NVARCHAR(4000) as we did not want to venture in the TEXT / NTEXT (which has been marked as to-be-deprecated in SQL Server 2005 hallelujah). The TSQL is a sample table the stored DNames.

create table DNames (
    customerId int identity(1,1) not null primary key clustered,
    DName nvarchar(4000) not null);
go

Whilst this solved our immediate storage problem, we soon ran into a scalability challenge as this table was going to store upwards of 100,000 users. Performance was critical (is it ever not..?). Execute the following script to add 10,000 rows (this takes up about 50 MB of data).

set nocount on;
go

declare @rows int;
set @rows = 0;
declare @DName nvarchar(4000);

-- Generates about 50 MB of 'random-ish' data.
while @rows < 10000 begin
    -- Put a very pseudo Dname together...
    set @DName= N'';
    declare @loops int;
    set @loops = 1 + (rand() * 100);
    while @loops > 0 begin
        set @DName = @DName + cast(newid() as nvarchar(38));
        set @loops = @loops - 1;
    end

    insert DNames (DName)
    values (@DName);

    set @rows = @rows + 1;
end;

-- This is the DName we're going to search for...
insert DNames (DName)
values (N'303ABE1C-9318-495D-A531-295CE1D45F1F303ABE1C');
go

I used the NEWID() function to generate a GUID (e.g. 139BC677-5496-42F0-850F-6C62D446ED3C) randomly 1 to 101 times to generate a 38 to 3,838 character string. Try not to use NEWID() for generating keys in the database as it is a fairly CPU expensive operation if you're doing a ton of inserts - rather do it in the application server if you must. I also add a hard coded value so that we can easily search for it later.

declare @DName nvarchar(4000);
set @DName = N'303ABE1C-9318-495D-A531-295CE1D45F1F303ABE1C';

-- Look for the DName.
select customerId
from DNames
where DName = @DName;
go

Execute the above search query and shows the clustered index scan in the execution:

Clustered Index Scan

Hash Indexes

The next challenge we faced was trying to figure out how to make the DName lookup fast. Adding an index on the DName could was not going to work. Try it out:

create index IX_DNames_DName
on DNames(DName);
go

First it returns a warning when inspecting the table definition and then goes bang as soon as it hits the first DName that exceeds the 900 byte (450 NCHARs) limit.

Warning! The maximum key length is 900 bytes. The index 'IX_DNames_DName' has maximum length of 8000 bytes. For some combination of large values, the insert/update operation will fail.
Msg 1946, Level 16, State 3, Line 1 Operation failed. The index entry of length 2664 bytes for the index 'IX_DNames_DName' exceeds the maximum length of 900 bytes.
The statement has been terminated.

In the actual system this table was going to contain about 100,000 rows. O(1) lookup times was therefore critical. Without some sort of an index the lookups would have been O(n) - table scan city.

The solution we used is the subject of this article. The key to cracking this problem was to use a technique that combines a number of SQL Server features to produce an indexing technique that has many applications. Here we'll look at its most basic implementation.

By using a combination of the CHECKSUM function and indexed computed columns it is possible to create hash indexes. SQL Server does not support hash indexes natively (like you'd specify a clustered or non-clustered index). They need to constructed manually, but if done correctly really deliver!

Hash indexes work on the same principle as a hash table. That is, reduce a large amount of data down to a number by transforming it through a hash function and index the number. The simplest example of a hash function is the modulo operator. For example, % 10 will take any integer and return a value between 0 and 9.

In reverse, to find if the original data exists in the hash table, run it through the hash function and search for the resulting number using the index. Unfortunately it doesn't stop there. Hash functions are not guaranteed to be unique for all possible values of input data (mathematically it is not possible). Allowances need to be made for collisions (i.e. data that is different but hashes to the same number).

Therefore, the search needs to get all instances of data that hashed to the same value and compare them, one-by-one till the original data is found. The real kicker here is that the amount of data that needs to be considered should be very, very small (if you chose a good hashing function that gives a good distribution). Most of the data should have been eliminated by the hash function.

For a hash function we use SQL Server's CHECKSUM function. The algorithm it implements is not documented but I have a strong suspicion it is CRC-32. The CHECKSUM function returns an integer (32-bit).

-- Add a computed column.
alter table DNames
add DNameHash as checksum(DName);
go

-- Index it. Note that this is a non-unqiue index for collisions.
create index IX_DNames_DNameHash
on DNames(DNameHash);
go

Now, to search for the original DName we inserted into the table, execute the following script:

declare @DName nvarchar(4000)
set @DName= N'303ABE1C-9318-495D-A531-295CE1D45F1F303ABE1C'

-- Look for the hash of the DName.
-- Always, always include the original data in case of collisions.

select customerId
from DNames
where DNameHash = checksum(@DName)
 and DName= @DName
go

Bring up the execution plan and note the index seek on the IX_DNames_DNameHash index and sub-second response.

Clustered Index Seek

SQL Server has recognised that the hash index will eliminate the most number of rows so it executes the seek on the hash first. Only then does it perform the expensive DName check. At that stage SQL Server should only need to be comparing a very, very small number of rows, normally only one if the DName exists.

Even better, this technique of creating hash indexes is already available with SQL Server 2000! So those that don't have SQL Server 2005 available right now can already use these hash indexes. There are some more advanced techniques of using hash indexes (using CHECKSUM_AGG) that we'll take look at in a future feature article.

Previous Posts