Blazing Fast (and Accurate) Searches Without an Index

Last week I demonstrated a fast binary search approach to quickly slice through large unindexed tables. I love the approach, but it fell short in two key areas – drop-in usage, and proper boundary handling.

Why

Let’s recap what we’re doing here:

Large append-heavy tables – like logs or audits – often don’t have a useful index on the timestamp. These types of tables do however have a strong correlation between their clustering key and the timestamp due to chronological inserts.

A binary search approach splits the table in half to narrow down the search space with each iteration. By abusing the incremental relationship between the clustering key and timestamps, we can quickly zero in on the point in time we’re after. If you want to see the mechanics, check out last week’s post.

A dynamic proc

This week the function has evolved into a stored proc which uses dynamic SQL to handle to any given table or field names.

Let’s jump right back in with last week’s question – How many posts had a score greater than 100 in January 2014?:

DECLARE @Start BIGINT, @End BIGINT;

EXEC dbo.FastBoundarySearch
	@DatabaseName = 'StackOverflow',
	@SchemaName = 'dbo',
	@TableName = 'Posts',
	@ClusteredFieldName = 'Id',
	@DateTimeFieldName = 'CreationDate',
	@RangeStartDateTime = '2014-01-01',
	@RangeEndDateTime = '2014-02-01',
	@StartRecordID = @Start OUTPUT,
	@EndRecordID = @End OUTPUT;

SELECT COUNT(1)
FROM StackOverflow.dbo.Posts
WHERE Id BETWEEN @Start AND @End
	AND Score > 100;

This proc uses the same binary search approach to zero in on the target timestamp, followed by a minimal range scan to find the exact boundary. It returns the inclusive clustering key boundaries for a given time window, which can then be used to query the table efficiently.

More importantly, this version correctly handles gaps and boundaries which weren’t fully considered last time. NULL values will be returned for invalid or non-existent ranges. This version is just as fast and even more accurate.

In addition there are a couple more parameters:

@MaxScanRange (default 1000) for a tighter or looser narrowing before scanning. Lower (1k-10k) is generally better
@Debug (default 0) to return the SQL statements and pointer positions through the search process

The code is somewhat lengthy so I’ll leave that at the end of the post if you want to jump down and grab a copy 👇

Performance

The core reason for this approach to searching is performance. Large unwieldy tables – particularly unindexed ones – are never fun (or quick) to deal with. For specific time bound searches, this approach is solid.

Recapping the performance figures:

🐢 Clustered index only: 42s CPU, +11m reads
🏃 Nonclustered index + lookups: 3s CPU, ~3m reads
⚡ Dynamic proc: 0.5s CPU, 138k reads

This is for a 40m record table at circa 100gb. With tables which have 1b+ records, you’ll feel the pain.

Covering indexes are ideal, but they’re not common for log or audit tables. A binary search approach has a tiny overhead identifying the ranges. Like a nonclustered seek with lookups, this scales linearly with data volume – but at a fraction of the cost.

Wrap up

In this slightly shorter post we’ve taken the previous binary search approach and elevated it to make it more portable, usable, and accurate – whilst still retaining the same high level of performance.

Where this approach fails is for tables without correlation between the clustering key and timestamp. That relationship is key to its implementation.

Where we do have the correlation – which is typical for append-heavy work such as log tables – a binary search approach is ideal for large, unindexed data sets.

If you haven’t already grabbed the proc below, go and give it a try with the largest table you can find, and see how it performs.

Based on the sad news this week, I’ll admit this isn’t as fast as Chuck Norris’ heap-seek method. That was the advice he gave the optimiser when it asked him. Also, his queries don’t yield after 4ms – everyone else yields indefinitely 🤠

Proc

Finally, the code:

CREATE OR ALTER PROC dbo.FastBoundarySearch (
	@DatabaseName SYSNAME = NULL,
	@SchemaName SYSNAME,
	@TableName SYSNAME,
	@ClusteredFieldName SYSNAME,
	@DateTimeFieldName SYSNAME,
	@RangeStartDateTime DATETIME2(7),
	@RangeEndDateTime DATETIME2(7),
	@MaxScanRange BIGINT = 1000,
	@Debug BIT = 0,
	@StartRecordID BIGINT OUTPUT,
	@EndRecordID BIGINT OUTPUT
) AS
BEGIN

	DECLARE @StartLowerPointer BIGINT, @StartUpperPointer BIGINT, @StartProbeDateTime DATETIME2(7),
		@EndLowerPointer BIGINT, @EndUpperPointer BIGINT, @EndProbeDateTime DATETIME2(7),
		@ProbePointer BIGINT, @ProbeResult BIGINT,
		@DynSql NVARCHAR(MAX), @SourceTable NVARCHAR(800);

	SET @ClusteredFieldName = QUOTENAME(@ClusteredFieldName);
	SET @DateTimeFieldName = QUOTENAME(@DateTimeFieldName);
	SET @SourceTable = CONCAT(
		QUOTENAME(ISNULL(@DatabaseName, DB_NAME())),
		'.', QUOTENAME(@SchemaName),
		'.', QUOTENAME(@TableName));

	/* Outer bounds */
	SET @DynSql = CONCAT('
		SELECT TOP (1) @Pointer = ', @ClusteredFieldName, ', @Timestamp = ', @DateTimeFieldName, '
		FROM ', @SourceTable, '
		ORDER BY ', @ClusteredFieldName, ' ASC;');

	IF (@Debug = 1) SELECT
		Stage = 'Prep',
		DynamicSql = @DynSql;

	EXEC sp_executesql @DynSql,
		N'@Pointer BIGINT OUTPUT, @Timestamp DATETIME2(7) OUTPUT',
		@Pointer = @StartLowerPointer OUTPUT,
		@Timestamp = @StartProbeDateTime OUTPUT;

	SET @DynSql = CONCAT('
		SELECT TOP (1) @Pointer = ', @ClusteredFieldName, ', @Timestamp = ', @DateTimeFieldName, '
		FROM ', @SourceTable, '
		ORDER BY ', @ClusteredFieldName, ' DESC;');

	IF (@Debug = 1) SELECT
		Stage = 'Prep',
		DynamicSql = @DynSql;

	EXEC sp_executesql @DynSql,
		N'@Pointer BIGINT OUTPUT, @Timestamp DATETIME2(7) OUTPUT',
		@Pointer = @EndUpperPointer OUTPUT,
		@Timestamp = @EndProbeDateTime OUTPUT;

	/* Avoid invalid or out of range combinations */
	IF (@StartProbeDateTime IS NULL OR @EndProbeDateTime IS NULL
		OR @RangeStartDateTime > @RangeEndDateTime
		OR @RangeEndDateTime < @StartProbeDateTime
		OR @RangeStartDateTime > @EndProbeDateTime)
		RETURN;

	SET @EndLowerPointer = @StartLowerPointer;
	SET @StartUpperPointer = @EndUpperPointer;
	
	/* Start bounds */
	WHILE (@StartUpperPointer - @StartLowerPointer > @MaxScanRange)
	BEGIN

		SET @ProbePointer = @StartLowerPointer + ((@StartUpperPointer - @StartLowerPointer) / 2);

		SET @DynSql = CONCAT('
			SELECT TOP (1)
				@ProbeResult = ', @ClusteredFieldName, ',
				@ProbeTimestamp = ', @DateTimeFieldName, '
			FROM ', @SourceTable, '
			WHERE ', @ClusteredFieldName, ' >= @ProbePointer
			ORDER BY ', @ClusteredFieldName, ' ASC;');
		
		IF (@Debug = 1) SELECT
			Stage = 'Probe (Start)',
			DynamicSql = @DynSql,
			ProbePointer = @ProbePointer,
			CurrentLower = @StartLowerPointer,
			CurrentUpper = @StartUpperPointer;

		EXEC sp_executesql @DynSql,
			N'@ProbePointer BIGINT, @ProbeResult BIGINT OUTPUT, @ProbeTimestamp DATETIME2(7) OUTPUT',
			@ProbePointer = @ProbePointer,
			@ProbeResult = @ProbeResult OUTPUT,
			@ProbeTimestamp = @StartProbeDateTime OUTPUT;

		/* If theres a large gap, try the other direction */
		IF (@ProbeResult = @StartUpperPointer)
		BEGIN

			SET @DynSql = CONCAT('
				SELECT TOP (1)
					@ProbeResult = ', @ClusteredFieldName, ',
					@ProbeTimestamp = ', @DateTimeFieldName, '
				FROM ', @SourceTable, '
				WHERE ', @ClusteredFieldName, ' < @ProbePointer
				ORDER BY ', @ClusteredFieldName, ' DESC;');
			
			IF (@Debug = 1) SELECT
				Stage = 'Gap (Start)',
				DynamicSql = @DynSql,
				ProbePointer = @ProbePointer,
				CurrentLower = @StartLowerPointer,
				CurrentUpper = @StartUpperPointer;

			EXEC sp_executesql @DynSql,
				N'@ProbePointer BIGINT, @ProbeResult BIGINT OUTPUT, @ProbeTimestamp DATETIME2(7) OUTPUT',
				@ProbePointer = @ProbePointer,
				@ProbeResult = @ProbeResult OUTPUT,
				@ProbeTimestamp = @StartProbeDateTime OUTPUT;

			/* If we have a genuine gap, drop out */
			IF (@ProbeResult = @StartLowerPointer) BREAK;

		END

		IF (@StartProbeDateTime < @RangeStartDateTime)
			SET @StartLowerPointer = @ProbeResult;
		ELSE IF (@StartProbeDateTime >= @RangeStartDateTime)
			SET @StartUpperPointer = @ProbeResult;

	END

	/* End bounds */
	WHILE (@EndUpperPointer - @EndLowerPointer > @MaxScanRange)
	BEGIN

		SET @ProbePointer = @EndLowerPointer + ((@EndUpperPointer - @EndLowerPointer) / 2);

		SET @DynSql = CONCAT('
			SELECT TOP (1)
				@ProbeResult = ', @ClusteredFieldName, ',
				@ProbeTimestamp = ', @DateTimeFieldName, '
			FROM ', @SourceTable, '
			WHERE ', @ClusteredFieldName, ' <= @ProbePointer
			ORDER BY ', @ClusteredFieldName, ' DESC;');
		
		IF (@Debug = 1) SELECT
			Stage = 'Probe (End)',
			DynamicSql = @DynSql,
			ProbePointer = @ProbePointer,
			CurrentLower = @EndLowerPointer,
			CurrentUpper = @EndUpperPointer;

		/* Probe */
		EXEC sp_executesql @DynSql,
			N'@ProbePointer BIGINT, @ProbeResult BIGINT OUTPUT, @ProbeTimestamp DATETIME2(7) OUTPUT',
			@ProbePointer = @ProbePointer,
			@ProbeResult = @ProbeResult OUTPUT,
			@ProbeTimestamp = @EndProbeDateTime OUTPUT;

		/* If theres a large gap, try the other direction */
		IF (@ProbeResult = @EndLowerPointer)
		BEGIN

			SET @DynSql = CONCAT('
				SELECT TOP (1)
					@ProbeResult = ', @ClusteredFieldName, ',
					@ProbeTimestamp = ', @DateTimeFieldName, '
				FROM ', @SourceTable, '
				WHERE ', @ClusteredFieldName, ' > @ProbePointer
				ORDER BY ', @ClusteredFieldName, ' ASC;');
			
			IF (@Debug = 1) SELECT
				Stage = 'Gap (End)',
				DynamicSql = @DynSql,
				ProbePointer = @ProbePointer,
				CurrentLower = @EndLowerPointer,
				CurrentUpper = @EndUpperPointer;

			EXEC sp_executesql @DynSql,
				N'@ProbePointer BIGINT, @ProbeResult BIGINT OUTPUT, @ProbeTimestamp DATETIME2(7) OUTPUT',
				@ProbePointer = @ProbePointer,
				@ProbeResult = @ProbeResult OUTPUT,
				@ProbeTimestamp = @EndProbeDateTime OUTPUT;

			/* If we have a genuine gap, drop out */
			IF (@ProbeResult = @EndUpperPointer) BREAK;

		END

		IF (@EndProbeDateTime < @RangeEndDateTime)
			SET @EndLowerPointer = @ProbeResult;
		ELSE IF (@EndProbeDateTime >= @RangeEndDateTime)
			SET @EndUpperPointer = @ProbeResult;

	END

	/* Perform the final scans */
	SET @DynSql = CONCAT('
		SELECT TOP (1) @RecordID = ', @ClusteredFieldName, '
		FROM ', @SourceTable, '
		WHERE ', @ClusteredFieldName, ' BETWEEN @StartPointer AND @EndPointer
			AND ', @DateTimeFieldName, ' >= @SearchDateTime
		ORDER BY ', @ClusteredFieldName, ' ASC;');

	IF (@Debug = 1) SELECT
		Stage = 'Finalise (Start)',
		DynamicSql = @DynSql,
		LowerPointer = @StartLowerPointer,
		UpperPointer = @StartUpperPointer,
		SearchDateTime = @RangeStartDateTime;

	EXEC sp_executesql @DynSql,
		N'@StartPointer BIGINT, @EndPointer BIGINT, @SearchDateTime DATETIME2(7), @RecordID BIGINT OUTPUT',
		@StartPointer = @StartLowerPointer,
		@EndPointer = @StartUpperPointer,
		@SearchDateTime = @RangeStartDateTime,
		@RecordID = @StartRecordID OUTPUT;

	SET @DynSql = CONCAT('
		SELECT TOP (1) @RecordID = ', @ClusteredFieldName, '
		FROM ', @SourceTable, '
		WHERE ', @ClusteredFieldName, ' BETWEEN @StartPointer AND @EndPointer
			AND ', @DateTimeFieldName, ' <= @SearchDateTime
		ORDER BY ', @ClusteredFieldName, ' DESC;');

	IF (@Debug = 1) SELECT
		Stage = 'Finalise (End)',
		DynamicSql = @DynSql,
		LowerPointer = @EndLowerPointer,
		UpperPointer = @EndUpperPointer,
		SearchDateTime = @RangeEndDateTime;

	EXEC sp_executesql @DynSql,
		N'@StartPointer BIGINT, @EndPointer BIGINT, @SearchDateTime DATETIME2(7), @RecordID BIGINT OUTPUT',
		@StartPointer = @EndLowerPointer,
		@EndPointer = @EndUpperPointer,
		@SearchDateTime = @RangeEndDateTime,
		@RecordID = @EndRecordID OUTPUT;

	IF (@EndRecordID < @StartRecordID)
		SELECT @StartRecordID = NULL, @EndRecordID = NULL;

END
GO

Why

A dynamic proc

Performance

Wrap up

Proc

Share this:

Related

Leave a comment Cancel reply