An Unfortunate Tale of Clustering a Heap

If you’ve ran sp_BlitzIndex before you might have come across some warnings about self-loathing heaps and maybe you’re looking to put a clustering key on them. Generally that’s a good idea to consider but it can cause issues if the data needed to be in that specific order.

In this instance it turned out there was a table where a default value was being selected via TOP 1 in a query which didn’t specify any ordering. This only came to light after a clustering key had been put on the table and folks started to ask about why volumes had moved from one item to another.

The issue was compounded as the unique field which was chosen for clustering was a GUID so that data was in all sorts of random order. The situation started out something like this…

DROP TABLE IF EXISTS #Numbers;
GO

CREATE TABLE #Numbers (
	Id UNIQUEIDENTIFIER DEFAULT(NEWID()),
	Number INT,
	NumberText VARCHAR(10)
);

/* Add some numbers into our table */
INSERT INTO #Numbers (Number, NumberText)
VALUES (-1, 'Unknown'), (0, 'Zero'), (1, 'One'),
	(2, 'Two'),	(3, 'Three'), (4, 'Four'), (5, 'Five');
GO

/* Look at our numbers, very unimpressive */
SELECT *
FROM #Numbers;
GO

To go with this we’ll also create a temporary procedure which will be causing the headache for us:

/* Here's a procedure which had been long forgotten */
CREATE OR ALTER PROCEDURE #GetMyNumber(
	@NumberText VARCHAR(10)
) AS
BEGIN

	DECLARE @NumberId UNIQUEIDENTIFIER;

	SELECT @NumberId = Id
	FROM #Numbers
	WHERE NumberText = @NumberText;

	/* And here is our gremlin waiting to pounce */
	IF (@NumberId IS NULL)
		SELECT TOP 1 @NumberId = Id
		FROM #Numbers;

	SELECT *
	FROM #Numbers
	WHERE Id = @NumberId;

END
GO

With those in place we can see what the ‘normal’ behaviour of the procedure is with a couple of examples – returning the number we search for or defaulting to the unknown record:

/* Running this now will give us record 5 */
EXEC #GetMyNumber @NumberText = 'Five';

/* And running this will get us the -1 record */
EXEC #GetMyNumber @NumberText = 'Eight';

Now many years later this is the point where someone comes to the code fresh and maybe trying to tidy things up, maybe wanting to add a foreign key for a new feature for example and they go and cluster this table:

CREATE UNIQUE CLUSTERED INDEX ClusteredId
ON #Numbers (Id);

/* Now things are looking a little different... */
SELECT *
FROM #Numbers;

What happens if we re-run those procedures from before?

/* We'll still get the result as we're targetting a specific record */
EXEC #GetMyNumber @NumberText = 'Five';

/* And now... it's anyone's guess what your result will be! */
EXEC #GetMyNumber @NumberText = 'Eight';

As a bonus if you run all of the above script sections over and over you’ll get different results for that last procedure each time as different GUIDs are generated each run to make the situation a little more spicy. End users like spicy, honest.

So at this point we’re in a bit of a sticky situation as the table has been shuffled and we can’t tell what shape it was in before and what records should be returned. This is when you might be forced to restore things back to a previous point, or if (as in this tale) you’re looking at transactional data which needs to be maintained then you might want to re-order those records again.

Lets Un-recluster Those Records!

The first thing you’ll want to know is what order those records need to be in. There are a few options here depending on your data:

If its only one record (as in our example) that should be straight forward to know which record is #1
There might be another key which could be used to get things back into the right order (a timestamp maybe)
Restore the database side-by-side and start working through the original table to extract the ordering

Once you know what order you need to get the data back into then its a case of working out a strategy to align those with the existing data. In this example I’ll use the second point above where I’ll sort it based on the Number field which is how we just happened to insert them the first time.

Below you’ll see one potential solution which is to temporarily add a new column to indicate the correct ordering and then based on that re-cluster the table followed by removing the extra column:

/* Add a new column where we can set the right order */
ALTER TABLE #Numbers
ADD OrderingColumn INT;
GO

WITH OrderedNumbers AS (

	/* Add whatever logic you need to get things into the right order */
	SELECT Id, ROW_NUMBER() OVER (ORDER BY Number) [Ordering]
	FROM #Numbers
)
/* Now populate that column */
UPDATE n
SET n.OrderingColumn = o.Ordering
FROM #Numbers n
	INNER JOIN OrderedNumbers o ON n.Id = o.Id;

/* Lets get rid of this nasty cluster */
DROP INDEX ClusteredId ON #Numbers;

/* Create the clustered index to get things in the right order
    and then drop it afterwards to get us back to a heap */
CREATE UNIQUE CLUSTERED INDEX OrderedCluster ON #Numbers (OrderingColumn);
DROP INDEX OrderedCluster ON #Numbers;

ALTER TABLE #Numbers
DROP COLUMN OrderingColumn;

/* Now lets check again to make sure we get Unknown */
EXEC #GetMyNumber @NumberText = 'Eight';

/* Finally we're back in the right order */
SELECT *
FROM #Numbers;

If this heap is an actively used table and there isn’t a window for this type of maintenance then you might want to look at creating this additional column with a default value or a highly seeded identity (similar to when we were removing duplicates from a heap) so you can separate the records you’re working on from the new ones coming in whilst still maintaining ordering. Unfortunately either way you’ll have the pain of dropping, recreating and re-dropping those clustered indexes which might need some serious downtime depending on the volume of data you’re working with.

Heaps are a great tool to have in your belt but when they go bad they don’t half sting.

Lets Un-recluster Those Records!

Share this:

Related

Leave a comment Cancel reply