Fotia Ltd

Fotia Ltd

Monday, 12 March 2007

When a String Is Not a String

posted by Stefan Delmarco

This article was inspired by a seemingly innocent bit of code that caused a performance problem in a very unexpected way. It is the kind of code that you'd find in many samples. Unfortunately, it has a potential flaw that is only exposed when large data volumes are present.

The Problem Code

The Data Access Layer (DAL) code in question is as follows:

public static class AccountDal
{
   public static string GetTerritoryForAccount(string accountNumber)
   {
      using(SqlConnection con = new SqlConnection(@"Server=localhost;Database=AdventureWorks;
         Integrated Security=SSPI"
))
      using(SqlCommand cmd = con.CreateCommand())
      {
         cmd.CommandText = @"
            select t.Name
            from Sales.Customer c
               inner join Sales.SalesTerritory t
                  on c.TerritoryID = t.TerritoryId
            where c.AccountNumber = @AccountNumber"
;
         cmd.CommandType = CommandType.Text;
         cmd.Parameters.AddWithValue("@AccountNumber", accountNumber);

         con.Open();
         return cmd.ExecuteScalar() as string;
      }
   }
}

Very simple stuff. Open a connection to SQL, select a specific row from the Sales.Customer table, join the resultant row to the Sales.SalesTerritory table and return the name of the territory. The code is making diligent use of parameters for reusable query plans, strong typing and to prevent SQL injection attacks. The following test exercises the code:

[TestClass]
public class AccountDalTests
{
   [TestMethod]
   public void CanGetTerritory()
   {
      string territory = AccountDal.GetTerritoryForAccount("AW00000010");

      Assert.AreEqual(territory, "Canada");
   }
}

This test passes, so functionally the code works as expected. Running this bit of code on a decently spec'd developer workstation takes about 800 ms. Running it repeatedly doesn't improve on this figure. The Sales.Customer table contains 19,185 rows. Given that we're looking up a single row in the Sales.Customer row the alarm bells should be ringing, especially as the Sales.Customer has a nonclustered, unique index on AccountNumber:

exec sp_helpindex 'Sales.Customer';

index_name index_description index_key
-------------------------- ----------------------------------------------------- --------------------
AK_Customer_AccountNumber nonclustered, unique located on PRIMARY AccountNumber
AK_Customer_rowguid nonclustered, unique located on PRIMARY rowguid
IX_Customer_TerritoryID nonclustered located on PRIMARY TerritoryID
PK_Customer_CustomerID clustered, unique, primary key located on PRIMARY CustomerID

Diagnosing the Issue

Capturing the TSQL being executed in SQL Profiler and running it through Query Analyzer shows the following query plan. The query plan shows exactly where all the time is being spent: Original Query Plan The IX_Customer_TerritoryID index is being used. However, we're performing an index scan of all the AccountNumbers instead of just an index seek to the specific row we're returning. The compute scalar step following the index scan shows precisely why. Expr1004 is the output of an implicit conversion of every AccountNumber to NVARCHAR(10). Looking at the definition of the Sales.Customer table shows that AccountNumber is, in fact, a VARCHAR(10):

select column_name, data_type, character_maximum_length
from information_schema.columns
where table_catalog = 'AdventureWorks'
 and table_schema = 'Sales'
 and table_name = 'Customer';

column_name data_type character_maximum_length
-------------- ------------------- ------------------------
CustomerID int NULL
TerritoryID int NULL
AccountNumber varchar 10
CustomerType nchar 1
rowguid uniqueidentifier NULL
ModifiedDate datetime NULL

Looking at the TSQL the application sends to SQL Server, we can see that it is binding @AccountNumber as an NVARCHAR(10):

exec sp_executesql N'
   select t.Name
   from Sales.Customer c
      inner join Sales.SalesTerritory t
         on c.TerritoryID = t.TerritoryId
   where c.AccountNumber = @AccountNumber'
,
   N'@AccountNumber nvarchar(10)',
   @AccountNumber=N'AW00000010'

So the application is binding an NVARCHAR(10) parameter that is then being compared to a VARCHAR(10) column. SQL Server then converts every row of the VARCHAR(10) column to an NVARCHAR(10) data type. Clearly this is where the performance problem is. The scalability of this query is therefore linearly dependant on the number of rows in the table. The more rows there are, the more values need to be converted. This not a good situation to be in.

Data Type Precedence

Whenever SQL Server is asked to compare two differing data types, it uses its data type precedence rules to determine which data type needs to be converted to perform the comparison. Assuming an implicit conversion exists, the data type that appears lower in the list is converted to the data type that appears higher in the list. SQL Server's Data Type Precedence order has NVARCHAR appearing higher than VARCHAR. Therefore, all VARCHAR data types need to be converted to NVARCHAR in order to evaluate the AccountNumber values.

Unfortunately, in our case, we have ~20,000 VARCHARs to convert to NVARCHAR and 1 NVARCHAR @AccountNumber! This is the reason for the index scan in the query plan. Note that with SQL Server 7.0, the query optimiser would always convert the literal and never the column. SQL Server 2000/5 strictly adheres to the Data Type Precedence rules, even if it means having to convert 19,185 rows. This change in behaviour is documented in KB 271566.

Looking back at the original C# code, we didn't specify a SqlDbType when adding the @AccountNumber parameter. The SqlParameter class will infer the SqlDbType from the object value specified if it is not specified explicitly. A System.String is always interpreted as SqlDbType.NVarChar as all .NET strings are Unicode.

There are a number of possible solutions to this problem:

  • Change the C# code to have knowledge of the SqlDbTypes the parameters are binding to
  • Standardise by only allowing inferred SqlDbType types to be used in your database

Personally, I prefer the first option by a country mile. The second option prevents you from designing a data model that fully exploits all of the advantages of the SQL Server domain. For example, disallowing VARCHARs will double the amount of storage required to store your VARCHARs. I prefer to use a data type dictionary.

Data Type Dictionary

Data dictionaries are normally only used within databases. However, introducing a data type dictionary in the DAL allows you to have fine-grained control over your data types as well as other data type facets such as length and precision. In addition, it provides for better type safety as users of the data dictionary can only pass a .NET string to get an AccountNumber SqlParameter object. Any other .NET would obviously be a compile time error. Lastly, should you ever have to change data types (e.g. accountNumber becomes a System.Int32) it would be easy to make the change to the GetAccountNumber method and get compile time errors where string accountNumbers are still being used.

public static class SqlDictionary
{
   public static SqlParameter GetAccountNumber(string accountNumber)
   {
      SqlParameter parameter = new SqlParameter("@AccountNumber", SqlDbType.VarChar, 10);
      parameter.Value = accountNumber;

      return parameter;
   }
}

With this change, the query plan now looks far more respectable: Updated Query Plan

Relying on inferred Object / Relational data mappings can lead to unexpected issues. This is a classic example of the ORM-impedance mismatch that so many tools try to abstract away. However, by utilising a simple data dictionary, many of these problems can be easily solved by writing disciplined code.

Default Implicit Conversion

SQL Server 2005 has mediated the data type precedence trap somewhat. For example, in SQL Server 2000, comparing a BIT column to a literal 1 always resulted in the BIT column being upcast to a TINYINT. Whilst this may sound a bit bizarre, the problem was that the literal number 1 was always implicitly assigned the TINYINT data type and TINYINT appears above BIT in the data type precedence rules. SQL Server 2005 now applies a bit more intelligence. It assigns the literal the same data type as the column so no conversion is required.

So, if you're still using SQL Server 2000, be sure to specify your BIT literals as 0x1. Whilst this doesn't equate to a BIT data type, its BINARY datatype is lower on the data type precedence rules and is therefore implicitly upcast to a BIT data type. Using this standard you'll never fall into the trap of upcasting all your BITS to TINYINTs. Note that if you run a SQL Server 2005 database in 80 compatability level you'll get the old SQL Server 2000 behaviour.

-- SQL Server 2000 implicit conversion issue.
-- SQL Server 2005 OK.

select ProductID
from Production.Product
where FinishedGoodsFlag = 1;

-- SQL Server 2000 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 0x1;

The implicit conversion is quite easy to spot in the query plan.

Convert SQL 200 vs 2005

Labels: ,

Previous Posts