Back again, Dealing with Storage-aligned Index oddities

I’m back after being on leave for past several weeks in case you wonder why I haven’t posted.  To quote Joy Gresham from the move ShadowLands “I wasn’t dead, I was just in America (fill-in-the-blank)” This was one of my favorite movies that had so many wonderful quotes ( and

Some of you may notice that my academic-research related content has been removed.  I am moving all of that stuff to a non-Microsoft blog since it really isn’t part of my day-to-day work with Microsoft as I am a consultant and not a Microsoft researcher. I’m going to keep this blog focused on my current Microsoft work involving SQL Server tuning and best practices, since after all it is a blog provided to me by Microsoft.

On to my post for today.  I think there is some confusion about how storage-alignment works when it comes to non-clustered indexes (at least I was).  There are some ramifications regarding performance and the types of plans chosen.  There is a significant advantage if you can keep all your indexes associated with a partitioned table storage-aligned.  It greatly simplifies and the automation of archival and sliding window scheme.  It can also help with contention issues.  In one of my applications where I was trying to parallelize my application to use different threads to perform simulation on a partitioning key (pardon the example from my research work…), I ran into deadlocks and all sorts of locking problems even though each thread was granulized to a filter to a different partition and I had set the locking escalation for partition rather than table.   The problem was the non-clustered indexes that were not storage aligned were getting escalated to range locks for sequential scans.

One of the reasons, I was using non-partitioned indexes was to help with performance as there were certain operations that were much slower when the partitioning column was associated with the index, Certain types of queries including join activity can be more optimal when using surrogates in other tables without the added weight of the partitioned column being propagated and included in all the joins.

I must confess to a previous fundamental misunderstanding of how partition alignment really works.  I was thinking that all indexes that were storage-aligned to a partitioned table were actually using the partitioned column as a hidden first segment on the alternate indexes to ensure perfect storage alignment.  However, that is not the case, if an index is created storage aligned, a separate b-tree is created at the index root level that got each partition range (not each partition value, that would be a nightmare).  In other words, if you have 100 different partitions based on a partitioning key and storage align on it (even without specifying the partition column as part of the aligned index), you will end up with 100 individual b-trees – 1 corresponding to each partition – each one will use an include column to store the partition. 

So, this architecture resolves one of my concerns, since seeks should be able to operate optimally even without filtering on the partitioned column, albeit with the overhead of visiting one b-tree per partition.   Storage alignment doesn’t mean alignment at the row level or even page level, but simply at the partition level.  For a great explanation of all this, see

There’s another good article at

To quote from the Hanspo:

“Partitioning an index transforms the single B-tree of a non-partitioned index into several smaller B-trees, one for each partition defined by the partition function, where each B-tree only consists of index rows where the partitioning column falls into the partition’s range. Therefore, the partitioning column always has to be part of a partitioned index.

Splitting a partitioned index into several independent B-trees obviously also has consequences for index seeks. As long as the partitioning column is one of the search predicates, the optimizer knows which of the trees to choose for the index seek. But if the filter criteria do not include the partitioning column, the optimizer would not know where to start the index seek or scan when using a partitioned index. E.g. if we create the following composite partitioned clustered index “CREATE CLUSTERED INDEX IX_Tab ON Tab (a, b)”, where table Tab is being partitioned on column b, the query “SELECT * FROM Tab WHERE a = @x” cannot be performed as a simple clustered index seek, like in the non-partitioned case, but each of the index trees would have to be traversed separately to find the qualifying records. As this example shows, for partitioned tables the rule, that the table rows are sorted by the clustered index keys, does not necessarily hold true for the whole table, but only for single partitions. Therefore, it’s always good practice to include the partitioning column in your search predicates, even if it does not affect the result set.”

From this, obviously it is more optimal if we partition on a column that is often used to filter, but do we need to redesign data structures to work more optimally with partitioning by replicating partitioned columns onto joined tables or have to use non-storage aligned indexes to get around the performance issues when filtering is not available?  Not necessarily. 

There are some tricks.  There is some behavior that seems baffling as far as query plans chosen for certain operations.  One example, is the case of a select max(non-partitioned index column) from a storage-aligned non-clustered index; Even with just a few partitions, the query optimizer seems to prefer an index scan rather than seeking on each of the b-trees.  The index scan may make sense if there are lots of little partitions, but in the case of a few big partitions it is sub-optimal, particularly when considering the near-zero latency of Solid State storage to support fast random-seeks.

Recently a colleague of mine ran into this issue on our SQL Master blog and since I’ve been dealing with it, decided to try some of the suggestions that came up and play with this some more.  Among the suggestion was the technique of a CTE with a group-by to force seeks within each partition.  An example of this is:

with localmax
as (select MAX(equityintradayid) as maxId from dbo.EquityIntraday group by MarketDate)select MAX(maxId)
from localmax

However in my scenario, I got a less than optimal query plan still doing an index scan and didn’t try making the efforts to clean it up.

The thread is still continuing and not all options have been explored, but at this point, my experience is that getting the max value of a non-clustered index column that is storage-aligned is one of the few things handled better the old fashion-way, rather than relying on the query optimizer. 

Using a simple select max(id) from a table with 79 million rows with a table partitioned on a date field and only 18 active partitions generated by default an index scan and took over 13 seconds of CPU and over 2 seconds elapsed on a dual XEON-Quad-core with 2 Fusion-IO SLC cards.  However, changing this to use a custom function improved the performance by 1000 times.  This highlights the significance of SSD storage for random access.  I have not tested this on a non-Fusion device.  My guess is that the performance would still be much better using the binary-search method, but the improvement is probably not nearly as dramatic.

Below contrasts the difference in results:


So, what does the function look like that provides the improvement.  Actually very simple, and easy to adapt to other scenarios.  Note, I have only tested this with a couple of tables, so have not tested all the boundary conditions, in case you try this and it gets stuck in an endless loop or returns the wrong result.

In a few days, I plan to post a stored procedure that receives a table and column name which will then perform the same type of logic so that this can be handled generically.

CREATE FUNCTION [dbo].[udf_GetTopIntradayId]






    -- Bob Leithiser - Sample Only.  Disclaimer: 

    /*    Works so far, but boundary conditions not completely tested.

        May go into an endless loop or return wrong result.


    -- Uses binary search to find the top EquityIntradayId 

-- Typicall much faster than simply doing a select max for

    -- the scenario of finding a max value on an non-clustered

    -- storage-aligned index where no filtering is done on 

    -- the partitioned column.  To use for another table, 

    -- do global search/replace of the table name and desired

    -- column.


    DECLARE @EquityIntradayId INT = NULL

    -- Decrement down until we don't get a match

    DECLARE @Iteration FLOAT = 31.00

    WHILE @Iteration > 0


        DECLARE @HighRange INT = POWER(2.00,@Iteration)-1.00

        DECLARE @LowRange INT = POWER(2.00,@Iteration-1.00)


        IF NOT EXISTS (SELECT TOP 1 0 FROM dbo.EquityIntraday h

                        WHERE h.EquityIntradayId > @LowRange)

            SET @Iteration = @Iteration - 1





    -- Search the range

    SET @HighRange = (@HighRange + @LowRange) / 2.00

    WHILE @LowRange <> @HighRange 


        IF EXISTS (SELECT TOP 1 0 FROM dbo.EquityIntraday 

                    WHERE EquityIntradayId >= (@LowRange + @HighRange) / 2.00)

            SET @LowRange = (@HighRange + @LowRange) / 2.00


            SET @HighRange = (@HighRange + @LowRange) / 2.00


    RETURN @HighRange  



That’s all I have for now.  Thanks for reading.  Remember I am very slow to respond on this blog, too much going on in my life, so this is sort of a 1-way brain-dump for me to contribute to the community, rather than in interactive blog.   If you are interested in my work (Microsoft or Academic-related), you can look for Bob Leithiser on linked-in and send me a request.  My thanks to Bob Duffy, Dirk Gubbels, Robert Davis, Assaf Fraenkel for much of the material for this post.  These guys are all Microsoft Certified SQL Masters and the type of people you want on your team for serious SQL Server stuff.


This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s