Removing Duplicate Records from a Heap

Dealing with duplicate records can be a headache when you’re expecting them to be unique and at some point the application or report users start to question why they’re seeing the same customer or product appearing twice. Once you’ve tracked them down you typically want to get rid of them and sometimes you’ll find an unfortunate situation where they’re in a heap without any sort of unique keys in place to help us remove (or keep) exactly the records we like.

In this case we’d actually stumbled upon it by accident a while back when performing an application upgrade and trying to apply a unique index found we’d got a few duplicate records which had cropped up over the years. Fortunately these hadn’t been causing any issues but we wanted to get things straightened out anyway.

Below is the set up for this example:

/* Drop and recreate the table for our testing */
DROP TABLE IF EXISTS dbo.DoubleTrouble;

CREATE TABLE dbo.DoubleTrouble (
	StoreID INT,
	ProductID INT,
	LastSoldDateID INT
);

/* Pop some 'real' data in there */
INSERT INTO dbo.DoubleTrouble (StoreID, ProductID, LastSoldDateID)
VALUES (1, 1, 20211101),
	(1, 2, 20210816),
	(1, 3, 20210912),
	(1, 4, 20201203),
	(1, 5, 20211108);

/* Oh noes, duplicates appeared! */
INSERT INTO dbo.DoubleTrouble (StoreID, ProductID, LastSoldDateID)
VALUES (1, 2, 20210816), (1, 4, 20201203);

/* Some time later... */

/* We want to try and create a unique index
	on the data which _should_ be feasbile... */
CREATE UNIQUE INDEX ux_Store_Product
ON dbo.DoubleTrouble (
	StoreID, ProductID
);

/* Well that just didn't work... */

/* Lets find those duplicate records */
SELECT StoreID, ProductID
FROM dbo.DoubleTrouble
GROUP BY StoreID, ProductID
HAVING COUNT(*) > 1;

/* So what if we want to delete one of the records?
	Well they're duplicates so we'd end up deleting both! */
SELECT *
FROM dbo.DoubleTrouble
WHERE StoreID = 1 AND ProductID = 2;

So now we’ve got duplicates we’d like to get rid of but there’s nothing to uniquely identify them. If you’ve ever tried to remove them via the SSMS UI then you’ll have have been greeted with an error something like this:

Below are a couple of approaches which I’d consider if I ran into this situation again. Each will have their merits and may depend on how active the database is and how many records there are in the table versus the number which need to be removed.

Creating a Clean Table

The first approach I’d consider is where we’ll create a new table and copy a good clean copy of the records in followed by dropping the original table and then renaming the new one in its place. You’d likely want to do this during some downtime on the system as you wouldn’t want any new records popping in after you’ve taken the copy into the new table and you might also get caught out when dropping the original table too.

/* Create the new table to store the unique records */
CREATE TABLE dbo.LessTroubly (
	StoreID INT,
	ProductID INT,
	LastSoldDateID INT
);

/* Create the index ahead of time to avoid bad data getting in */
CREATE UNIQUE INDEX ux_Store_Product
ON dbo.LessTroubly (
	StoreID, ProductID
);

/* Insert a distinct list of records */
INSERT INTO dbo.LessTroubly (
	StoreID,
	ProductID,
	LastSoldDateID
)
SELECT
	DISTINCT
	StoreID,
	ProductID,
	LastSoldDateID
FROM
	dbo.DoubleTrouble;

/* Now drop the old table and put the new one in its place */
DROP TABLE dbo.DoubleTrouble;
EXEC sp_rename @objname = 'dbo.LessTroubly', @newname = 'DoubleTrouble';

Deleting from the Existing Table

An alternative approach would be to look at deleting the records from the source table which – if done carefully – may help avoid the downtime required with the method above. Additionally as this uses a temporary table to store the records we want to keep this could be adjusted to delete the data incrementally if you’ve really got a lot of duplicates to be working through.

/* Add a new identity field into the table which will auto populate for us */
ALTER TABLE dbo.DoubleTrouble
ADD TempIdentity INT IDENTITY(1, 1);

/* It also won't necessarily break inserts either */
INSERT INTO dbo.DoubleTrouble (StoreID, ProductID, LastSoldDateID)
VALUES (1, 3, 20210912), (1, 5, 20211108);

/* Lets create a table to store the records to keep */
DROP TABLE IF EXISTS #KeyRecords;

CREATE TABLE #KeyRecords (
	StoreID INT,
	ProductID INT,
	LastSoldDateID INT,
	KeyRecord INT
);

INSERT INTO #KeyRecords (
	StoreID,
	ProductID,
	LastSoldDateID,
	KeyRecord
)
SELECT
	StoreID,
	ProductID,
	LastSoldDateID,
	MIN(TempIdentity)
FROM
	dbo.DoubleTrouble
GROUP BY
	StoreID,
	ProductID,
	LastSoldDateID;

/* Now remove the records which arent our key ones */
DELETE dt
FROM dbo.DoubleTrouble dt
	INNER JOIN #KeyRecords kr ON dt.StoreID = kr.StoreID
		AND dt.ProductID = kr.ProductID
		AND dt.LastSoldDateID = kr.LastSoldDateID
WHERE dt.TempIdentity <> kr.KeyRecord;

/* We can now apply our unique index to the table */
CREATE UNIQUE INDEX ux_Store_Product
ON dbo.DoubleTrouble (
	StoreID, ProductID
);

/* Now we can tidy up and remove the column and temp table */
ALTER TABLE dbo.DoubleTrouble DROP COLUMN TempIdentity;
DROP TABLE #KeyRecords;

With some relatively small changes both of these methods above can be adjusted so that this could help weed out duplicate key records which don’t cover all fields. For example in this case if the Last Sold Date was different for each records and only the Store and Product combinations were the key, we could adjust the logic to pick the latest Last Sold Date and insert those records into our new table or temporary table.

One reply on “Removing Duplicate Records from a Heap”

[…] this additional column with a default value or a highly seeded identity (similar to when we were removing duplicates from a heap) so you can separate the records you’re working on from the new ones coming in whilst still […]

Creating a Clean Table

Deleting from the Existing Table

Share this:

Related

One reply on “Removing Duplicate Records from a Heap”

Leave a comment Cancel reply