SQL Practice Problem

Delete duplicate rows keeping one

Hard Analytics Patterns

Schema and sample data

Contacts
idnameemail
1Alice Wongalice@example.com
2Bob Diazbob@example.com
3Alice Wongalice@example.com
4Cara Leecara@example.com
5Bob Diazbob@example.com
6Alice Wongalice@example.com
7Cara Leecara@example.com
Show setup SQL (copy to run)
CREATE TABLE Contacts (
    id    INT PRIMARY KEY,
    name  VARCHAR(50) NOT NULL,
    email VARCHAR(100) NOT NULL
);

INSERT INTO Contacts (id, name, email) VALUES
(1, 'Alice Wong', 'alice@example.com'),
(2, 'Bob Diaz',   'bob@example.com'),
(3, 'Alice Wong', 'alice@example.com'),
(4, 'Cara Lee',   'cara@example.com'),
(5, 'Bob Diaz',   'bob@example.com'),
(6, 'Alice Wong', 'alice@example.com'),
(7, 'Cara Lee',   'cara@example.com');
Your task

The Contacts table has accumulated duplicate rows: the same name and email pair was inserted several times, each time getting a new id. Write a statement that deletes the duplicates so that exactly one row survives for each distinct (name, email) pair. Keep the copy with the smallest id and remove the rest.

The result shown below is the contents of the table after the delete, ordered by id.

Expected result

Your query should return
idnameemail
1Alice Wongalice@example.com
2Bob Diazbob@example.com
4Cara Leecara@example.com
Show hint

You cannot tell two identical rows apart with a plain WHERE, so number the copies instead. Assign a running number within each (name, email) group with ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id). The row you want to keep gets number 1; everything with a number above 1 is a duplicate you can delete.

Try to write the query yourself before you open this.

Show solution and explanation
BEGIN TRANSACTION;

WITH ranked AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY name, email
            ORDER BY id
        ) AS rn
    FROM Contacts
)
DELETE FROM ranked
WHERE rn > 1;

COMMIT TRANSACTION;

-- Inspect the survivors
SELECT id, name, email
FROM Contacts
ORDER BY id;

The core idea is to give every row a position within its own duplicate group. PARTITION BY name, email starts the numbering over for each distinct pair, and ORDER BY id decides which copy is number 1. Because ids are ascending, the smallest id in each group always receives rn = 1 and every later copy gets 2, 3, and so on. See ROW_NUMBER() for how the ranking is built.

In SQL Server you can DELETE directly from the common table expression: DELETE FROM ranked WHERE rn > 1 removes exactly the rows the CTE tagged as extras, and the change flows back to the base Contacts table. With the sample data, ids 3 and 6 (Alice), 5 (Bob), and 7 (Cara) are deleted, leaving ids 1, 2, and 4.

Choosing which copy to keep is a real decision, not just a detail. Here we keep the smallest id (the oldest record); flip to ORDER BY id DESC to keep the newest, or order by any other column if that is your tie breaker. Whatever you pick, wrap the operation in a transaction as shown so you can run the SELECT first, confirm the row count, and ROLLBACK instead of COMMIT if the numbers look wrong. A delete without a defined tie breaker or a safety net is the classic way to lose the wrong rows.

Before deleting, it is worth confirming the duplicates exist the way you expect - see the scenario walkthrough for the matching "find the duplicates" query. Once the table is clean, a UNIQUE constraint on (name, email) stops the problem from coming back.

Keep practising

Work through more Hard exercises, or test yourself with the SQL interview questions.