Fotia Ltd

Fotia Ltd

Monday, 12 March 2007

When a String Is Not a String

posted by Stefan Delmarco

This article was inspired by a seemingly innocent bit of code that caused a performance problem in a very unexpected way. It is the kind of code that you'd find in many samples. Unfortunately, it has a potential flaw that is only exposed when large data volumes are present.

The Problem Code

The Data Access Layer (DAL) code in question is as follows:

public static class AccountDal
{
   public static string GetTerritoryForAccount(string accountNumber)
   {
      using(SqlConnection con = new SqlConnection(@"Server=localhost;Database=AdventureWorks;
         Integrated Security=SSPI"
))
      using(SqlCommand cmd = con.CreateCommand())
      {
         cmd.CommandText = @"
            select t.Name
            from Sales.Customer c
               inner join Sales.SalesTerritory t
                  on c.TerritoryID = t.TerritoryId
            where c.AccountNumber = @AccountNumber"
;
         cmd.CommandType = CommandType.Text;
         cmd.Parameters.AddWithValue("@AccountNumber", accountNumber);

         con.Open();
         return cmd.ExecuteScalar() as string;
      }
   }
}

Very simple stuff. Open a connection to SQL, select a specific row from the Sales.Customer table, join the resultant row to the Sales.SalesTerritory table and return the name of the territory. The code is making diligent use of parameters for reusable query plans, strong typing and to prevent SQL injection attacks. The following test exercises the code:

[TestClass]
public class AccountDalTests
{
   [TestMethod]
   public void CanGetTerritory()
   {
      string territory = AccountDal.GetTerritoryForAccount("AW00000010");

      Assert.AreEqual(territory, "Canada");
   }
}

This test passes, so functionally the code works as expected. Running this bit of code on a decently spec'd developer workstation takes about 800 ms. Running it repeatedly doesn't improve on this figure. The Sales.Customer table contains 19,185 rows. Given that we're looking up a single row in the Sales.Customer row the alarm bells should be ringing, especially as the Sales.Customer has a nonclustered, unique index on AccountNumber:

exec sp_helpindex 'Sales.Customer';

index_name index_description index_key
-------------------------- ----------------------------------------------------- --------------------
AK_Customer_AccountNumber nonclustered, unique located on PRIMARY AccountNumber
AK_Customer_rowguid nonclustered, unique located on PRIMARY rowguid
IX_Customer_TerritoryID nonclustered located on PRIMARY TerritoryID
PK_Customer_CustomerID clustered, unique, primary key located on PRIMARY CustomerID

Diagnosing the Issue

Capturing the TSQL being executed in SQL Profiler and running it through Query Analyzer shows the following query plan. The query plan shows exactly where all the time is being spent: Original Query Plan The IX_Customer_TerritoryID index is being used. However, we're performing an index scan of all the AccountNumbers instead of just an index seek to the specific row we're returning. The compute scalar step following the index scan shows precisely why. Expr1004 is the output of an implicit conversion of every AccountNumber to NVARCHAR(10). Looking at the definition of the Sales.Customer table shows that AccountNumber is, in fact, a VARCHAR(10):

select column_name, data_type, character_maximum_length
from information_schema.columns
where table_catalog = 'AdventureWorks'
 and table_schema = 'Sales'
 and table_name = 'Customer';

column_name data_type character_maximum_length
-------------- ------------------- ------------------------
CustomerID int NULL
TerritoryID int NULL
AccountNumber varchar 10
CustomerType nchar 1
rowguid uniqueidentifier NULL
ModifiedDate datetime NULL

Looking at the TSQL the application sends to SQL Server, we can see that it is binding @AccountNumber as an NVARCHAR(10):

exec sp_executesql N'
   select t.Name
   from Sales.Customer c
      inner join Sales.SalesTerritory t
         on c.TerritoryID = t.TerritoryId
   where c.AccountNumber = @AccountNumber'
,
   N'@AccountNumber nvarchar(10)',
   @AccountNumber=N'AW00000010'

So the application is binding an NVARCHAR(10) parameter that is then being compared to a VARCHAR(10) column. SQL Server then converts every row of the VARCHAR(10) column to an NVARCHAR(10) data type. Clearly this is where the performance problem is. The scalability of this query is therefore linearly dependant on the number of rows in the table. The more rows there are, the more values need to be converted. This not a good situation to be in.

Data Type Precedence

Whenever SQL Server is asked to compare two differing data types, it uses its data type precedence rules to determine which data type needs to be converted to perform the comparison. Assuming an implicit conversion exists, the data type that appears lower in the list is converted to the data type that appears higher in the list. SQL Server's Data Type Precedence order has NVARCHAR appearing higher than VARCHAR. Therefore, all VARCHAR data types need to be converted to NVARCHAR in order to evaluate the AccountNumber values.

Unfortunately, in our case, we have ~20,000 VARCHARs to convert to NVARCHAR and 1 NVARCHAR @AccountNumber! This is the reason for the index scan in the query plan. Note that with SQL Server 7.0, the query optimiser would always convert the literal and never the column. SQL Server 2000/5 strictly adheres to the Data Type Precedence rules, even if it means having to convert 19,185 rows. This change in behaviour is documented in KB 271566.

Looking back at the original C# code, we didn't specify a SqlDbType when adding the @AccountNumber parameter. The SqlParameter class will infer the SqlDbType from the object value specified if it is not specified explicitly. A System.String is always interpreted as SqlDbType.NVarChar as all .NET strings are Unicode.

There are a number of possible solutions to this problem:

  • Change the C# code to have knowledge of the SqlDbTypes the parameters are binding to
  • Standardise by only allowing inferred SqlDbType types to be used in your database

Personally, I prefer the first option by a country mile. The second option prevents you from designing a data model that fully exploits all of the advantages of the SQL Server domain. For example, disallowing VARCHARs will double the amount of storage required to store your VARCHARs. I prefer to use a data type dictionary.

Data Type Dictionary

Data dictionaries are normally only used within databases. However, introducing a data type dictionary in the DAL allows you to have fine-grained control over your data types as well as other data type facets such as length and precision. In addition, it provides for better type safety as users of the data dictionary can only pass a .NET string to get an AccountNumber SqlParameter object. Any other .NET would obviously be a compile time error. Lastly, should you ever have to change data types (e.g. accountNumber becomes a System.Int32) it would be easy to make the change to the GetAccountNumber method and get compile time errors where string accountNumbers are still being used.

public static class SqlDictionary
{
   public static SqlParameter GetAccountNumber(string accountNumber)
   {
      SqlParameter parameter = new SqlParameter("@AccountNumber", SqlDbType.VarChar, 10);
      parameter.Value = accountNumber;

      return parameter;
   }
}

With this change, the query plan now looks far more respectable: Updated Query Plan

Relying on inferred Object / Relational data mappings can lead to unexpected issues. This is a classic example of the ORM-impedance mismatch that so many tools try to abstract away. However, by utilising a simple data dictionary, many of these problems can be easily solved by writing disciplined code.

Default Implicit Conversion

SQL Server 2005 has mediated the data type precedence trap somewhat. For example, in SQL Server 2000, comparing a BIT column to a literal 1 always resulted in the BIT column being upcast to a TINYINT. Whilst this may sound a bit bizarre, the problem was that the literal number 1 was always implicitly assigned the TINYINT data type and TINYINT appears above BIT in the data type precedence rules. SQL Server 2005 now applies a bit more intelligence. It assigns the literal the same data type as the column so no conversion is required.

So, if you're still using SQL Server 2000, be sure to specify your BIT literals as 0x1. Whilst this doesn't equate to a BIT data type, its BINARY datatype is lower on the data type precedence rules and is therefore implicitly upcast to a BIT data type. Using this standard you'll never fall into the trap of upcasting all your BITS to TINYINTs. Note that if you run a SQL Server 2005 database in 80 compatability level you'll get the old SQL Server 2000 behaviour.

-- SQL Server 2000 implicit conversion issue.
-- SQL Server 2005 OK.

select ProductID
from Production.Product
where FinishedGoodsFlag = 1;

-- SQL Server 2000 OK.
select ProductID
from Production.Product
where FinishedGoodsFlag = 0x1;

The implicit conversion is quite easy to spot in the query plan.

Convert SQL 200 vs 2005

Labels: ,

Thursday, 18 January 2007

Covariant Generic List

posted by Stefan Delmarco

The introduction of generics in .NET 2.0 has opened up a whole new approach to class design in C#. The first, and most obvious, benefit generics provide is strongly typed containers. We no longer have to keep casting, boxing / unboxing, to / from object for ArrayList items anymore.

However, with the power of generics come a few limitations. This article explores some subtle restrictions in covariance with generics.

What is Covariance?

In mathematical terms covariance is the measure of the tendency of two things to move or vary together. In the world of object-oriented programming, covariance describes the situation in which a derived type is used where a base class was expected. For example, in .NET arrays are covariant. That is:

DerivedClass[] derivedClasses = new DerivedClass[3];
BaseClass[] baseClasses = derivedClasses;

As a worked example, let's say that we have a People class. An instance of the People class is created from a data source that requires read / write access to the class's properties. A good example of such a data source is XML serialisation. The XmlSerializer class requires read / write properties and a default constructor in order to create an object from XML. However, once this object has been created, I want the object to be immutable. I don't want any properties of the object to be changed once it has been created. The natural implementation for this requirement is to create the IPerson interface that only exposes the property getters:

// Immutable Person.
public interface IPerson
{
   string Name { get; }
}

public class Person : IPerson
{
   private string name;

   public Person()
   {}

   public string Name
   {
      get { return this.name; }
      set { this.name = value; }
   }
}

public static class PersonFactory
{
   // Factory method to create an IPerson instance.
   public static IPerson CreatePerson()
   {
      // Simulate a data source.
      Person person = new Person();
      person.Name = "Trogdor";

      return person;
   }
}

Generic Covariance Limitations

Now, let's say that we aren't creating single Person objects. Instead the data source returns an IList of Person objects. We would naturally try the following code first:

public static class PeopleFactory
{
   public static IList<IPerson> CreatePeople()
   {
      // Simulate a data source.
      List<Person> people = new List<Person>();
      // Person objects added to list.
      // ...

      return people;
   }
}

The code above does not compile. The C# compiler fails with error:

CS0030: Cannot convert type 'System.Collections.Generic.List<TestScratch.Person>' to 'System.Collections.Generic.List<TestScratch.IPerson>'

The problem we've run into here has to do with the generic type compatibility, specifically, generics are not covariant.

This is nicely summarised over on MSDN: Are Generics Covariant, Contra-Variant, or Invariant?. In a nutshell, generics cannot be covariant as it would allow illegal constructs. For example:

public class Adult : IPerson
{
   //
}

public class Child : IPerson
{
   //
}

public void WhyCoVarianceIsNotAllowed()
{
   List<Adult> adults = new List<Adult>();
   adults.Add(new Adult());
   // If the following line would be allowed...
   List<IPerson> people = adults;
   // This line would have to be legal, i.e. adding a Child to an Adult List.
   people.Add(new Child());
}

A Generic Covariant List Implementation

As dire as this seems, not all is lost. A natural extension to the immutable object requirement would be to constrain the list of immutable objects to be itself immutable. This additional restriction allows us to circumvent the generic covariance issue as the immutable list would not allow the 'illegal' situation to arise. The most natural interface to return for an immutable list is IEnumerable<T>. We therefore need to find a way of converting / casting IList<Person> to IEnumerable<IPerson>. As the C# compiler does not make any generic covariance exceptions, it will still not allow us to directly cast IList<Person> to IEnumerable<IPerson>. Instead we need to find a way to express the readonly nature of IEnumerable in a way that gets the C# compiler to accept that what we're trying to do is legal. That is what the following class accomplishes:

public class EnumerableGeneric<TClass, TInterface> : IEnumerable<TInterface>
   where TClass : TInterface
{
   private IList<TClass> list;

   public EnumerableGeneric(IList<TClass> list)
   {
      this.list = list;
   }

   public IEnumerator<TInterface> GetEnumerator()
   {
      foreach(TClass item in list)
         yield return item;
   }

   IEnumerator IEnumerable.GetEnumerator()
   {
      return this.GetEnumerator();
   }
}

This class pulls together pretty much all of the .NET 2.0 features. Firstly, it uses generic constraints to ensure that TClass implements TInterface (or TInterface could be a base class of TClass). Secondly, the compiler understands constraint as it allows us to use yield return to implicitly cast TClass to TInterface. This class then becomes a simple generic wrapper that can be used whenever a covariant list is required:

public static class PeopleFactory
{
   public static IEnumerable<IPerson> CreatePeople()
   {
      // Simulate a data source.
      IList<Person> people = new List<Person>();
      // Person objects added to list.

      // ...
      return new EnumerableGeneric<Person, IPerson>(people);
   }
}

Labels:


Previous Posts