Dealing with Duplicate Data

Data quality and consistency is key to the services we support and solutions we deliver. A gremlin which can undermine that is duplicate data. Let’s start the new year dealing with duplicate data and having a good clear-out.

For our example we’ll consider an Order Product table which contains an OrderID and ProductID, and the combination of these should be unique. Other fields for the duplicate records may differ so we may want to be selective about which records are removed.

No feeding after midnight

The first step to dealing with this gremlin is identifying the source of duplication. This needs to be dealt with upfront. Removing duplicates before the root cause is resolved will require a further passes, and would delay adding any unique constraints.

To help with this step you may want to interrogate the system functions and views, looking across the instance, or searching through source control for a reference to the table.

Until that’s resolved, the bad data will keep being created. Once it’s out of the way though, we can get to the meat of the work – cleaning up the duplicates.

Removing the records

With the root cause is resolved and confident that no more duplicates are being added we’re ready to start removing the duplicate records.

The first step in removing the duplicate records is determining which ones we want to keep. For this we need to understand what defines a unique record (the key), and if we do find a duplicate key, what decides which record we want to keep?

In our example the key would be the combination of OrderID and ProductID.

In the simplest case where we have duplicate records which are identical, we can simply use the script below:

WITH Ranking AS (
	SELECT
		RowNum = ROW_NUMBER() OVER (
			PARTITION BY OrderID, ProductID
			ORDER BY (SELECT NULL)
			)
	FROM dbo.OrderProducts
)
DELETE
FROM Ranking
WHERE RowNum > 1;

Here we use the ROW_NUMBER() function followed by WHERE clause to identify and separate the first instance of a key, and remove the others.

Now, if the duplicate records have other fields with different values, we may want to choose which to retain. Based on the fields available, we could take a few approaches:

An identity column on the table could be used to select the earliest / latest record
A timestamp may be used for the earliest / latest entry
Another business attribute such as a specific Status or a larger / smaller Quantity or Value could be selected

In this instance we’ll use a LastModified timestamp to select the most recently modified record.

With that decision made, we’ll use the same script again with the ORDER BY clause adjusted based on our criteria:

WITH Ranking AS (
	SELECT
		RowNum = ROW_NUMBER() OVER (
			PARTITION BY OrderID, ProductID
			ORDER BY LastModified DESC			/* Change as needed */
			)
	FROM dbo.OrderProducts
)
DELETE
FROM Ranking
WHERE RowNum > 1;

You could consider wrapping this into a loop with a TOP (n) clause to batch the deletions. However due to the partitioning and ordering, if there’s no supporting index, a full table scan will be needed to apply the ROW_NUMBER() function.

To handle larger volumes of deletions or to reduce blocking, we may want to take a different approach…

Recreating the table

In some situations a better approach may be to recreate the table and insert only the unique records. This can be beneficial where you have a very large portion of the table which needs to be removed, or an active table where you want to reduce blocking.

For this approach, we’ll recreate a copy of the table, add the records we want to keep, and then switch the two tables out. It’ll look something like this:

/* Create a new table with the same schema */
CREATE TABLE dbo.OrderProducts_Dedupe (
	OrderID INT,
	ProductID INT,
	Quantity INT,
	SaleValue MONEY,
	LastModified DATETIME
);

/* Insert the unique records (similar to previously) */
WITH Ranking AS (
	SELECT
		*,
		RowNum = ROW_NUMBER() OVER (
			PARTITION BY OrderID, ProductID
			ORDER BY LastModified DESC		/* Change as needed */
			)
	FROM dbo.OrderProducts
)
INSERT INTO dbo.OrderProducts_Dedupe
SELECT OrderID, ProductID, Quantity, SaleValue, LastModified
FROM Ranking
/* Now we only want the 1st record */
WHERE RowNum = 1;

/* Switcheroo */
EXEC sp_rename @objname = 'dbo.OrderProducts', @newname = 'OrderProducts_Old';
EXEC sp_rename @objname = 'dbo.OrderProducts_Dedupe', @newname = 'OrderProducts';

This approach can be more complex if you also have additional objects such as foreign keys or indexes which reference the table, so keep that in mind.

Enforcing uniqueness

Whilst the meat of the work is removing the duplicate data, we’ll complete this SQL sandwich by enforcing uniqueness within our data.

This allows us to understand the data better from the schema, and also avoids any possible reoccurrence from other sources (because we fixed the root cause earlier, remember!).

In this example I’d want to apply a unique index to the data. This would demonstrate the link between the two fields and also enforce the uniqueness of that relationship.

As a bonus, the index could also help with query performance when searching or filtering with either of those fields too.

We’d simply create the index as follows:

CREATE UNIQUE NONCLUSTERED INDEX UX_OrderProduct
ON dbo.OrderProducts (
	OrderID,
	ProductID
);

With that in place, our duplicate data is gone, we’re safe from its return, and we’ve restored integrity to our data.

Wrap up

In this post we’ve looked at two approaches for dealing with duplicate data in a table.

We started out by crucially identifying and stemming the flow of bad data. Once the duplicates were under control we looked at the following ways to clean up the bad data:

Using a function to identify the unique records and remove others, also using the ORDER BY clause to handle differing data
Recreating the table if we’re removing a very large volume of records, or if we wanted to retain the existing version for validation

Finally we stressed the importance of demonstrating the uniqueness of the data through a unique index or constraint. This helps both the database engine and other developers to better understand our schema.

Finding duplicate records in our data can be concerning. Whilst it’s tricky at times to find the source of the duplication, hopefully you’ll now agree that dealing with those duplicates doesn’t need to be.

No feeding after midnight

Removing the records

Recreating the table

Enforcing uniqueness

Wrap up

Share this:

Related

One reply on “Dealing with Duplicate Data”

Leave a comment Cancel reply