Monday, 12 March 2007
When a String Is Not a String
posted by Stefan Delmarco
This article was inspired by a seemingly innocent bit of code that caused a performance problem in a very unexpected way. It is the kind of code that you'd find in many samples. Unfortunately, it has a potential flaw that is only exposed when large data volumes are present.
The Problem Code
The Data Access Layer (DAL) code in question is as follows:
{
public static string GetTerritoryForAccount(string accountNumber)
{
using(SqlConnection con = new SqlConnection(@"Server=localhost;Database=AdventureWorks;
Integrated Security=SSPI"))
using(SqlCommand cmd = con.CreateCommand())
{
cmd.CommandText = @"
select t.Name
from Sales.Customer c
inner join Sales.SalesTerritory t
on c.TerritoryID = t.TerritoryId
where c.AccountNumber = @AccountNumber";
cmd.CommandType = CommandType.Text;
cmd.Parameters.AddWithValue("@AccountNumber", accountNumber);
con.Open();
return cmd.ExecuteScalar() as string;
}
}
}
Very simple stuff. Open a connection to SQL, select a specific row from the Sales.Customer table, join the resultant row to the Sales.SalesTerritory table and return the name of the territory. The code is making diligent use of parameters for reusable query plans, strong typing and to prevent SQL injection attacks. The following test exercises the code:
public class AccountDalTests
{
[TestMethod]
public void CanGetTerritory()
{
string territory = AccountDal.GetTerritoryForAccount("AW00000010");
Assert.AreEqual(territory, "Canada");
}
}
This test passes, so functionally the code works as expected. Running this bit of code on a decently spec'd developer workstation takes about 800 ms. Running it repeatedly doesn't improve on this figure. The Sales.Customer table contains 19,185 rows. Given that we're looking up a single row in the Sales.Customer row the alarm bells should be ringing, especially as the Sales.Customer has a nonclustered, unique index on AccountNumber:
| index_name | index_description | index_key |
| -------------------------- | ----------------------------------------------------- | -------------------- |
| AK_Customer_AccountNumber | nonclustered, unique located on PRIMARY | AccountNumber |
| AK_Customer_rowguid | nonclustered, unique located on PRIMARY | rowguid |
| IX_Customer_TerritoryID | nonclustered located on PRIMARY | TerritoryID |
| PK_Customer_CustomerID | clustered, unique, primary key located on PRIMARY | CustomerID |
Diagnosing the Issue
Capturing the TSQL being executed in SQL Profiler and running it through Query Analyzer shows the following query plan. The query plan shows exactly where all the time is being spent:
The IX_Customer_TerritoryID index is being used. However, we're performing an index scan of all the AccountNumbers instead of just an index seek to the specific row we're returning. The compute scalar step following the index scan shows precisely why. Expr1004 is the output of an implicit conversion of every AccountNumber to NVARCHAR(10). Looking at the definition of the Sales.Customer table shows that AccountNumber is, in fact, a VARCHAR(10):
from information_schema.columns
where table_catalog = 'AdventureWorks'
and table_schema = 'Sales'
and table_name = 'Customer';
| column_name | data_type | character_maximum_length |
| -------------- | ------------------- | ------------------------ |
| CustomerID | int | NULL |
| TerritoryID | int | NULL |
| AccountNumber | varchar | 10 |
| CustomerType | nchar | 1 |
| rowguid | uniqueidentifier | NULL |
| ModifiedDate | datetime | NULL |
Looking at the TSQL the application sends to SQL Server, we can see that it is binding @AccountNumber as an NVARCHAR(10):
select t.Name
from Sales.Customer c
inner join Sales.SalesTerritory t
on c.TerritoryID = t.TerritoryId
where c.AccountNumber = @AccountNumber',
N'@AccountNumber nvarchar(10)',
@AccountNumber=N'AW00000010'
So the application is binding an NVARCHAR(10) parameter that is then being compared to a VARCHAR(10) column. SQL Server then converts every row of the VARCHAR(10) column to an NVARCHAR(10) data type. Clearly this is where the performance problem is. The scalability of this query is therefore linearly dependant on the number of rows in the table. The more rows there are, the more values need to be converted. This not a good situation to be in.
Data Type Precedence
Whenever SQL Server is asked to compare two differing data types, it uses its data type precedence rules to determine which data type needs to be converted to perform the comparison. Assuming an implicit conversion exists, the data type that appears lower in the list is converted to the data type that appears higher in the list. SQL Server's Data Type Precedence order has NVARCHAR appearing higher than VARCHAR. Therefore, all VARCHAR data types need to be converted to NVARCHAR in order to evaluate the AccountNumber values.
Unfortunately, in our case, we have ~20,000 VARCHARs to convert to NVARCHAR and 1 NVARCHAR @AccountNumber! This is the reason for the index scan in the query plan. Note that with SQL Server 7.0, the query optimiser would always convert the literal and never the column. SQL Server 2000/5 strictly adheres to the Data Type Precedence rules, even if it means having to convert 19,185 rows. This change in behaviour is documented in KB 271566.
Looking back at the original C# code, we didn't specify a SqlDbType when adding the @AccountNumber parameter. The SqlParameter class will infer the SqlDbType from the object value specified if it is not specified explicitly. A System.String is always interpreted as SqlDbType.NVarChar as all .NET strings are Unicode.
There are a number of possible solutions to this problem:
- Change the C# code to have knowledge of the SqlDbTypes the parameters are binding to
- Standardise by only allowing inferred SqlDbType types to be used in your database
Personally, I prefer the first option by a country mile. The second option prevents you from designing a data model that fully exploits all of the advantages of the SQL Server domain. For example, disallowing VARCHARs will double the amount of storage required to store your VARCHARs. I prefer to use a data type dictionary.
Data Type Dictionary
Data dictionaries are normally only used within databases. However, introducing a data type dictionary in the DAL allows you to have fine-grained control over your data types as well as other data type facets such as length and precision. In addition, it provides for better type safety as users of the data dictionary can only pass a .NET string to get an AccountNumber SqlParameter object. Any other .NET would obviously be a compile time error. Lastly, should you ever have to change data types (e.g. accountNumber becomes a System.Int32) it would be easy to make the change to the GetAccountNumber method and get compile time errors where string accountNumbers are still being used.
{
public static SqlParameter GetAccountNumber(string accountNumber)
{
SqlParameter parameter = new SqlParameter("@AccountNumber", SqlDbType.VarChar, 10);
parameter.Value = accountNumber;
return parameter;
}
}
With this change, the query plan now looks far more respectable:
Relying on inferred Object / Relational data mappings can lead to unexpected issues. This is a classic example of the ORM-impedance mismatch that so many tools try to abstract away. However, by utilising a simple data dictionary, many of these problems can be easily solved by writing disciplined code.
Default Implicit Conversion
SQL Server 2005 has mediated the data type precedence trap somewhat. For example, in SQL Server 2000, comparing a BIT column to a literal 1 always resulted in the BIT column being upcast to a TINYINT. Whilst this may sound a bit bizarre, the problem was that the literal number 1 was always implicitly assigned the TINYINT data type and TINYINT appears above BIT in the data type precedence rules. SQL Server 2005 now applies a bit more intelligence. It assigns the literal the same data type as the column so no conversion is required.
So, if you're still using SQL Server 2000, be sure to specify your BIT literals as 0x1. Whilst this doesn't equate to a BIT data type, its BINARY datatype is lower on the data type precedence rules and is therefore implicitly upcast to a BIT data type. Using this standard you'll never fall into the trap of upcasting all your BITS to TINYINTs. Note that if you run a SQL Server 2005 database in 80 compatability level you'll get the old SQL Server 2000 behaviour.
-- SQL Server 2005 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 1;
-- SQL Server 2000 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 0x1;
The implicit conversion is quite easy to spot in the query plan.
