Home › SQL Interview Questions › Intermediate

SQL Interview Prep

Intermediate SQL Interview Questions

Q: What is the difference between a subquery and a join, and when do you use each?

A subquery is a query nested inside another query, used for scalar values, existence tests with IN or EXISTS, or as a derived table. A join combines tables side by side to return columns from each. Use a join to pull columns from multiple tables, and a subquery for existence or aggregate tests.

Q: What is a common table expression (CTE)?

A CTE is a named temporary result set defined with a WITH clause that exists only for the single statement that follows it. It improves readability over nested subqueries, can be referenced multiple times, and can be recursive.

Q: What is a window function and how is it different from GROUP BY?

A window function computes a value across a set of related rows (the window) defined by an OVER clause, but keeps every input row in the result. GROUP BY instead collapses each group into a single summary row, so you lose the detail.

Q: What is the difference between ROW_NUMBER, RANK and DENSE_RANK?

All three assign a number to each row over an ordered partition. ROW_NUMBER is always unique (1,2,3,4). RANK gives ties the same number then skips values, leaving gaps (1,1,3). DENSE_RANK gives ties the same number but never skips, so there are no gaps (1,1,2).

Q: What is a correlated subquery?

A correlated subquery is an inner query that references a column from the outer query, so it cannot run on its own and is conceptually evaluated once per row of the outer query. It is common with EXISTS and per-row aggregates, and can often be rewritten as a join or window function for better performance.

Q: What do the set operators UNION, INTERSECT and EXCEPT do?

They combine two result sets with matching columns: UNION returns all rows from both with duplicates removed, INTERSECT returns rows found in both, and EXCEPT returns rows in the first query but not the second. All three remove duplicates by default and treat NULLs as equal.

Q: What is a view and what are its benefits and limits?

A view is a named virtual table defined by a stored SELECT statement. It simplifies complex queries, gives a stable interface, and can restrict columns for security, but a standard view stores no data and reruns its query every time unless it is an indexed or materialized view.

Q: What is an index and how does it speed up queries?

An index is a sorted data structure, usually a B-tree, that lets the engine seek directly to the rows it needs instead of scanning the whole table. It makes matching reads much faster, at the cost of slower writes and extra storage.

Q: What is the difference between a clustered and a non-clustered index?

A clustered index defines the physical order of the table itself, so its leaf level is the actual data rows; there is only one per table. A non-clustered index is a separate structure whose leaf holds the key plus a pointer back to the row, so you can have many.

Q: What is a transaction and what does ACID mean?

A transaction is a single unit of work that either fully commits or fully rolls back, so the database is never left half changed. ACID stands for Atomicity, Consistency, Isolation, and Durability, the four guarantees a reliable transaction provides.

Once the basics are covered, interviewers test whether you understand the trickier distinctions. These intermediate questions cover subqueries, CTEs, window functions, set operators and the classic "what is the difference between X and Y" pairs, each with a diagram and SQL.

Intermediate 15 questions

1 What is the difference between a subquery and a join, and when do you use each?

A subquery is a complete SELECT nested inside another statement. It comes in a few shapes: a scalar subquery returns one row and one column and can sit anywhere a single value is allowed; a multi-row subquery feeds operators like IN or EXISTS; and a derived table is a subquery in the FROM clause that you query as if it were a table.

A join works differently: it places two or more tables side by side and matches their rows on a related column, so the result set can carry columns from every table at once. That is the key contrast - a join is built to combine columns, while a subquery is often used to filter or test the outer query.

Many subqueries can be rewritten as joins and the other way around. In SQL Server the query optimizer frequently flattens an IN or EXISTS subquery into a semi join internally, so the two forms often produce the same execution plan and similar performance. The choice is then mostly about clarity.

Prefer a join when you need columns from several tables in the output, since it reads naturally and avoids repeating logic.
Prefer a subquery for pure existence or aggregate tests (does a matching row exist, is a value above the average) where you do not want the extra columns and do not want a join to duplicate outer rows.
Use a derived table when you need to pre-aggregate before joining.

One trap with scalar subqueries: if a subquery used where a single value is expected returns more than one row, SQL Server raises error 512 (subquery returned more than 1 value). A join would instead return the extra rows, so understanding this difference in behavior is part of choosing the right tool.

Subquery with IN: customers who have placed an order

SELECT c.name
FROM Customers AS c
WHERE c.customer_id IN (
    SELECT o.customer_id
    FROM Orders AS o
);

The inner query returns a set of customer ids and the outer query keeps customers whose id is in that set.

Equivalent JOIN: same result, columns available from both tables

SELECT DISTINCT c.name
FROM Customers AS c
INNER JOIN Orders AS o
    ON o.customer_id = c.customer_id;

The join matches the tables directly; DISTINCT removes duplicates since a customer with many orders would otherwise repeat.

Follow-up questions to expect

When would EXISTS be better than IN for a subquery?
What is a correlated subquery and how does it affect performance?
Why might the equivalent JOIN need DISTINCT when the IN subquery does not?
How does a derived table differ from a common table expression (CTE)?

2 What is a common table expression (CTE)?

A common table expression, or CTE, is a named temporary result set that you define with a WITH clause at the top of a query. It exists only for the duration of the single statement that immediately follows it, then disappears. Think of it as a query-scoped alias for a subquery that you can refer to by name.

The main reason to use one is readability. Instead of burying logic in a deeply nested subquery, you name each step and read the query top to bottom. Because the CTE has a name, you can also reference it more than once in the same statement without repeating the SQL, which a plain derived table in the FROM clause cannot do.

Scope: a CTE lives for exactly one statement. A view, by contrast, is a saved schema object that any query can use later.
Chaining: you can define several CTEs separated by commas, and a later one can build on an earlier one.
Recursion: a CTE can refer to itself, which lets you walk hierarchies such as an employee-manager tree or a bill of materials. See recursive CTEs for the anchor-plus-recursive pattern.

An important and often-tested detail: a CTE is not materialized or cached by default in SQL Server. The optimizer inlines its definition into the outer query, so referencing a CTE twice can cause its underlying query to be evaluated twice. A CTE does not by itself make a query faster; it is a naming and structuring tool. If you genuinely need to compute a result once and reuse it, store it in a temp table or table variable instead.

Compared with a derived table (a subquery in the FROM clause), a CTE is functionally similar but sits above the main query and can be named and reused. Compared with a view, a CTE is disposable and local to one statement, whereas a view is persisted metadata that also serves as an access-control and abstraction layer. Microsoft documents the full syntax in the WITH common_table_expression reference.

Simple WITH: filter to high-value orders, then aggregate

WITH BigOrders AS (
    SELECT customer_id, total
    FROM Orders
    WHERE total > 1000
)
SELECT customer_id, COUNT(*) AS big_order_count, SUM(total) AS big_total
FROM BigOrders
GROUP BY customer_id;

The BigOrders CTE names the filtered set once, and the main query reads cleanly against that name.

Multi-CTE chain: each step builds on the previous one

WITH OrderTotals AS (
    SELECT customer_id, SUM(total) AS lifetime_value
    FROM Orders
    GROUP BY customer_id
),
Ranked AS (
    SELECT customer_id, lifetime_value,
           RANK() OVER (ORDER BY lifetime_value DESC) AS value_rank
    FROM OrderTotals
)
SELECT customer_id, lifetime_value
FROM Ranked
WHERE value_rank <= 10;

Two CTEs separated by a comma: Ranked reads from OrderTotals, so the logic flows top to bottom instead of nesting.

Follow-up questions to expect

Does a CTE improve performance compared to a subquery?
What is the difference between a CTE and a temporary table?
How do you write a recursive CTE, and what stops it looping forever?
When would you use a view instead of a CTE?

3 What is a window function and how is it different from GROUP BY?

A window function performs a calculation across a set of rows that are related to the current row, called the window, and returns a value for every input row. You attach it to an aggregate or ranking function with the OVER clause: OVER(PARTITION BY ... ORDER BY ...). The PARTITION BY splits the rows into groups, and the optional ORDER BY orders the rows inside each partition (which matters for ranking, running totals and offsets).

The key contrast with GROUP BY is what happens to the rows. GROUP BY collapses each group into one summary row, so the individual detail rows disappear and you can only select grouped columns and aggregates. A window function keeps every row and simply adds the computed value as an extra column, so you can show a detail row and its group total side by side without a self join or subquery.

Window functions fall into three families:

Ranking: ROW_NUMBER, RANK and DENSE_RANK number or rank rows within a partition. See the ROW_NUMBER function reference.
Aggregate windows: SUM, AVG, COUNT, MIN and MAX used with OVER to produce group totals or running totals alongside each row.
Offset: LAG and LEAD read a value from a previous or following row in the ordered partition, which is ideal for period over period comparisons.

A window function always runs after the WHERE, GROUP BY and HAVING steps, which is why you cannot filter directly on one in a WHERE clause; you wrap the query in a CTE or subquery and filter the outer query instead. For a fuller tour with more patterns, see the SQL window functions guide.

Window aggregate: keep every row and show its department total

SELECT
    emp_id,
    department,
    salary,
    SUM(salary) OVER (PARTITION BY department) AS dept_total
FROM Employees;

Every employee row is returned, with the department total repeated on each row. No detail is lost.

GROUP BY: collapse each department into one summary row

SELECT
    department,
    SUM(salary) AS dept_total
FROM Employees
GROUP BY department;

One row per department only. The individual emp_id and salary values are gone because the rows are collapsed.

Follow-up questions to expect

What is the difference between ROW_NUMBER, RANK and DENSE_RANK?
How would you compute a running total with a window function?
Why can you not use a window function in a WHERE clause, and how do you work around it?
What does PARTITION BY do compared to GROUP BY?

4 What is the difference between ROW_NUMBER, RANK and DENSE_RANK?

ROW_NUMBER, RANK and DENSE_RANK are all window functions that number rows within an ordered group. They share the same syntax, an OVER (PARTITION BY ... ORDER BY ...) clause, so the only thing that changes is how each one treats ties (rows with equal values in the ORDER BY).

ROW_NUMBER assigns a strictly unique number to every row: 1, 2, 3, 4. Even when two rows tie, it breaks the tie arbitrarily and never repeats a number.
RANK gives tied rows the same number, then skips the numbers it used up. Two rows tied for first both get 1, and the next row gets 3, so the sequence is 1, 1, 3 with a gap.
DENSE_RANK also gives tied rows the same number but does not skip. Two rows tied for first get 1 and the next row gets 2, producing 1, 1, 2 with no gaps.

A simple way to remember it: RANK counts positions (like Olympic medals, where two silvers mean no bronze), while DENSE_RANK counts distinct values. ROW_NUMBER ignores ties entirely and just hands out a running counter.

Which one to pick depends on the goal. Use ROW_NUMBER when you need exactly one row per group, for example the single latest order per customer or a deterministic page of results. Use RANK or DENSE_RANK when tied rows should genuinely share a placing, such as leaderboards or nth-highest queries; choose DENSE_RANK when you want the nth distinct value and RANK when the gaps are meaningful.

Because ROW_NUMBER is non-deterministic across ties, always add a tie-breaker column to the ORDER BY if you need repeatable output.

All three side by side over the same ordering

SELECT
    name,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn,
    RANK()       OVER (ORDER BY salary DESC) AS rnk,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS drnk
FROM Employees;

For salaries 5000, 5000, 4000 you get rn = 1,2,3 (unique), rnk = 1,1,3 (gap after the tie) and drnk = 1,1,2 (no gap).

Top row per group with ROW_NUMBER

WITH ranked AS (
    SELECT
        department_id,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY department_id
            ORDER BY salary DESC, emp_id
        ) AS rn
    FROM Employees
)
SELECT department_id, name, salary
FROM ranked
WHERE rn = 1;

ROW_NUMBER guarantees exactly one row per department. The extra ORDER BY emp_id is a tie-breaker so the result is deterministic.

Follow-up questions to expect

How would you return the second highest salary using one of these functions?
Why is ROW_NUMBER non-deterministic without a tie-breaker in the ORDER BY?
When would DENSE_RANK give a different nth-highest result than RANK?
Can you use RANK in a WHERE clause directly, and if not, how do you filter on it?

5 What is a correlated subquery?

A correlated subquery is a subquery that depends on the outer query: it references a column from the outer query in its WHERE clause, so it cannot be executed on its own. Because the inner query needs a value from the current outer row, it is conceptually evaluated once for every row the outer query produces. This is the key difference from a plain (non-correlated) subquery, which is self contained and can be run once and reused.

You will most often see correlated subqueries in two shapes:

Existence checks with EXISTS or NOT EXISTS, where the subquery asks "is there a matching row for this outer row?" and stops as soon as one is found.
Per-row aggregates, where the subquery computes something like a count, sum, or max that is scoped to the current outer row (for example, each customer compared against their own average order).

The trade off is performance. That row by row evaluation can be slow on large tables, so the optimizer often has to work harder than it would for a set based join. In many cases the same result is better expressed as a JOIN, or as a window function that computes the per-row value in a single pass. Modern query optimizers can sometimes flatten a correlated subquery into a join automatically, but writing the clearer set based version yourself gives you more predictable plans.

Note that EXISTS is a good reason to keep a correlated subquery: it short circuits on the first match and does not care about duplicates, which often makes it both readable and fast. The rule of thumb is to reach for a correlated subquery for existence tests, and to prefer a join or window function when you need to return or aggregate data across rows. See the EXISTS documentation for details.

Correlated EXISTS: customers who have placed at least one order

SELECT c.customer_id, c.name
FROM Customers AS c
WHERE EXISTS (
    SELECT 1
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
);

The inner query references c.customer_id from the outer row, so it is correlated. EXISTS stops at the first matching order instead of counting them all.

Per-row aggregate vs a window function rewrite

-- Correlated subquery: orders above that customer's average
SELECT o.order_id, o.customer_id, o.total
FROM Orders AS o
WHERE o.total > (
    SELECT AVG(o2.total)
    FROM Orders AS o2
    WHERE o2.customer_id = o.customer_id
);

-- Same idea with a window function (one pass, no per-row subquery)
SELECT order_id, customer_id, total
FROM (
    SELECT order_id, customer_id, total,
           AVG(total) OVER (PARTITION BY customer_id) AS avg_total
    FROM Orders
) AS x
WHERE total > avg_total;

The correlated version recomputes the average per outer row; the window version computes each customer average once with AVG() OVER (PARTITION BY ...).

Follow-up questions to expect

When is a correlated subquery faster than a join, and when is it slower?
How does EXISTS differ from IN when the subquery can return NULLs?
How would you rewrite a per-row aggregate subquery using a window function?
Can the query optimizer turn a correlated subquery into a join automatically?

6 What do the set operators UNION, INTERSECT and EXCEPT do?

The set operators combine the output of two SELECT statements vertically, stacking one result set on top of the other rather than joining them side by side. To use them the two queries must return the same number of columns, in the same order, with compatible data types. The column names of the final result come from the first query.

UNION returns every distinct row that appears in either query. Duplicate rows are removed, so it does an implicit sort or hash to deduplicate. Use UNION ALL when you want to keep duplicates and skip that work, which is faster.
INTERSECT returns only the distinct rows that appear in both result sets.
EXCEPT returns the distinct rows from the first query that do not appear in the second query. It is the SQL Server spelling of the relational difference; some databases call it MINUS.

A key detail interviewers listen for: INTERSECT and EXCEPT, like UNION, remove duplicates and compare whole rows, and they treat two NULL values as equal. This is different from a regular comparison such as =, where NULL = NULL is unknown rather than true, which makes these operators handy for comparing two data sets that may contain NULLs.

EXCEPT is a clean way to answer "which rows are missing" without writing an outer join and an IS NULL filter. It is closely related to an anti join, and for keeping duplicates you should understand UNION versus UNION ALL. See the Microsoft docs on EXCEPT and INTERSECT for the full rules.

Precedence matters when you mix operators in one statement: INTERSECT binds tighter than UNION and EXCEPT, so wrap parts in parentheses when the intended order is not obvious. An ORDER BY can appear only once, at the very end, and it sorts the combined result.

INTERSECT: customers who are also employees

SELECT email FROM Customers
INTERSECT
SELECT email FROM Employees
ORDER BY email;

Returns only the email values present in both tables, with duplicates removed and NULLs matched as equal.

EXCEPT: products that have never been ordered

SELECT product_id FROM Products
EXCEPT
SELECT product_id FROM OrderItems;

Returns product_id values in Products that do not appear in OrderItems, a concise way to find missing rows without a LEFT JOIN and IS NULL check.

Follow-up questions to expect

What is the difference between UNION and UNION ALL, and which is faster?
How would you rewrite an EXCEPT query using a LEFT JOIN or NOT EXISTS?
How do the set operators handle NULL values compared to a normal equality test?
When you mix UNION and INTERSECT in one statement, which is evaluated first?

7 What is a view and what are its benefits and limits?

A view is a named, virtual table defined by a stored SELECT statement. It has no data of its own: when you query a view, the database expands its definition and runs the underlying query against the base tables, then returns the result as if it were a table. You can read more in SQL views explained.

Views deliver three main benefits:

Simplify complex queries. A multi-table join with aggregation can be wrapped in a view, so callers write SELECT * FROM SalesSummary instead of repeating the whole query.
Provide a stable interface. The view definition can change (new joins, renamed columns) while the view name and output columns stay the same, insulating applications from schema churn.
Restrict columns for security. You can grant access to a view that exposes only safe columns or rows and revoke access to the base tables, so users never see sensitive data.

The key limits to mention are:

Not materialized. A standard view stores no data and reruns the underlying query on every access, so it does not speed anything up by itself. An indexed view in SQL Server (a materialized view in other engines) does persist the result and can be maintained automatically, at the cost of storage and slower writes.
Updatable views have restrictions. You can INSERT, UPDATE, or DELETE through a view only when it maps cleanly to one base table. Views using JOIN, GROUP BY, DISTINCT, aggregates, or UNION are generally read only unless you add an INSTEAD OF trigger.

Because a view is just a saved query, it is closely related to a CTE: a CTE is scoped to a single statement, while a view is a reusable database object.

CREATE VIEW: wrap a join and aggregation behind one name

CREATE VIEW dbo.CustomerOrderSummary
AS
SELECT c.customer_id,
       c.name,
       COUNT(o.order_id) AS order_count,
       SUM(o.total)      AS lifetime_value
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.name;

The view stores only this definition, not the rows; the query runs against the base tables each time it is used.

Query the view like a table

SELECT name, lifetime_value
FROM dbo.CustomerOrderSummary
WHERE order_count > 0
ORDER BY lifetime_value DESC;

Callers treat the view as a virtual table and never repeat the join or GROUP BY.

Follow-up questions to expect

When would you use an indexed or materialized view instead of a standard view?
Can you update data through a view, and what conditions must the view meet?
What is the difference between a view and a CTE?
How does WITH SCHEMABINDING affect a view?

8 What is an index and how does it speed up queries?

An index is a separate, sorted data structure that the database keeps alongside a table so it can find rows quickly. The classic analogy is the index at the back of a book: instead of reading every page to find a topic, you look it up in the sorted index and jump straight to the right page. Without an index the engine must do a table scan, reading every row to check the filter.

Most indexes are stored as a B-tree (balanced tree). The keys are held in sorted order across a shallow tree of pages, so the engine starts at the root and follows a few pointers down to the leaf that holds the value it wants. Because the tree stays balanced and shallow, a seek touches only a handful of pages even in a table with millions of rows, turning a linear scan into something close to a logarithmic lookup. See how SQL indexes work for a deeper walk through.

SQL Server has two broad kinds. A clustered index defines the physical order of the table itself, so the table data is the leaf level of the tree; a table can have only one. A non-clustered index is a separate structure whose leaves hold the key plus a pointer back to the row, and a table can have many. A non-clustered index that lacks a needed column may still trigger a lookup back into the table, which is why INCLUDEd columns and covering indexes matter.

Indexes are a trade-off. Reads that filter or sort on the indexed columns get much faster, but every INSERT, UPDATE and DELETE now has to maintain the index too, so writes are slower and each index costs extra storage. The rule of thumb is to index the columns you actually query on and no more.

A selective WHERE (an equality or range that returns a small fraction of rows) benefits most.
JOIN columns benefit, because the engine can seek into the other table instead of scanning it.
ORDER BY and GROUP BY can skip a sort when the data is already in index order.
Low-selectivity filters (for example a flag with two values) often gain nothing, because a scan is cheaper than many seeks.

For more on measuring the payoff, see how SQL indexes improve performance.

Create a non-clustered index on a frequently filtered column

CREATE NONCLUSTERED INDEX IX_Orders_CustomerId
    ON Orders (customer_id)
    INCLUDE (order_date, total);

The key lets the engine seek by customer_id; INCLUDE stores order_date and total in the leaf so the query is covered without a lookup back to the table.

A query that benefits from the index

SELECT order_date, total
FROM Orders
WHERE customer_id = 4271
ORDER BY order_date;

The selective WHERE lets the engine seek straight to one customer, and the index order supplies the ORDER BY for free, so no table scan or sort is needed.

Follow-up questions to expect

What is the difference between a clustered and a non-clustered index?
Why can adding too many indexes hurt performance?
What is a covering index and when would you use INCLUDE columns?
How would you tell whether a query is using an index seek or a table scan?

9 What is the difference between a clustered and a non-clustered index?

Both are B-tree structures that speed up lookups, but they differ in what sits at the leaf level. A clustered index is not a copy of the table sitting beside it; it is the table, physically stored and sorted by the index key. Because a table can only be laid out in one physical order, you get at most one clustered index per table. Its leaf pages hold the full data rows in key order.

A non-clustered index is a separate structure that stores the index key columns in sorted order, and each leaf entry carries a row locator that points back to the underlying row. If the table has a clustered index, that locator is the clustered key; if the table is a heap (no clustered index), it is a physical RID (file, page, slot). You can create many non-clustered indexes on one table. See the SQL indexes guide for a fuller walk-through.

Clustered: one per table, leaf level = the data rows, table is physically sorted by the key.
Non-clustered: many per table, leaf level = key columns plus a locator, stored separately from the data.
In SQL Server a PRIMARY KEY creates a clustered index by default (unless you ask for NONCLUSTERED), and a UNIQUE constraint defaults to non-clustered.

The practical cost of a non-clustered index shows up as the key lookup (or RID lookup): when a query filters using the index but also selects columns the index does not contain, the engine must follow the locator back to the clustered index or heap to fetch the missing columns. Each lookup is a separate read, so this gets expensive over many rows.

You avoid lookups with included columns. Adding INCLUDE (col1, col2) stores extra non-key columns at the leaf level of the non-clustered index, so the index can satisfy the query on its own. When an index contains every column a query needs, it becomes a covering index and the lookup disappears.

Clustered index: physically orders the table by the key

CREATE CLUSTERED INDEX IX_Orders_OrderDate
    ON Orders (order_date);

The rows of Orders are now stored sorted by order_date; there can be only one such index per table.

Non-clustered index with included columns to avoid key lookups

CREATE NONCLUSTERED INDEX IX_Orders_CustomerId
    ON Orders (customer_id)
    INCLUDE (order_date, total);

The key is customer_id; INCLUDE stores order_date and total at the leaf so a query selecting those columns needs no lookup.

Follow-up questions to expect

What is a key lookup, and how do included columns eliminate it?
Why can a table have only one clustered index but many non-clustered ones?
What is a heap, and how does a non-clustered index locate rows on it?
How does the choice of clustered key affect insert performance and fragmentation?

10 What is a transaction and what does ACID mean?

A transaction is a group of one or more statements treated as a single, indivisible unit of work. Either every statement succeeds and the changes are made permanent with COMMIT, or something fails and every change is undone with ROLLBACK. The classic example is transferring money between two accounts: debiting one and crediting the other must both happen, or neither should.

The guarantees a transaction provides are summed up by the acronym ACID:

Atomicity: all statements in the transaction complete as one unit, or none of them do.
Consistency: the transaction moves the database from one valid state to another, keeping every constraint and rule intact.
Isolation: concurrent transactions do not see each other's uncommitted changes, controlled by the isolation level.
Durability: once a transaction commits, its changes survive a crash or power loss because they are written to durable storage.

In T-SQL you open a transaction with BEGIN TRAN, confirm it with COMMIT, or undo it with ROLLBACK. In production code you normally wrap the work in a TRY...CATCH block so that any error rolls the whole thing back instead of leaving a partial write. See transactions, locks and deadlocks for how this interacts with concurrency.

For a deeper walkthrough with examples, read SQL transactions and ACID explained. Getting ACID right is the difference between a database you can trust with money and one you cannot.

Explicit transaction: transfer money between accounts

BEGIN TRAN;

UPDATE Accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE Accounts SET balance = balance + 100 WHERE account_id = 2;

COMMIT;

Both updates commit together. If you issued ROLLBACK instead of COMMIT, neither update would take effect.

TRY CATCH with ROLLBACK for safe error handling

BEGIN TRY
    BEGIN TRAN;

    UPDATE Accounts SET balance = balance - 100 WHERE account_id = 1;
    UPDATE Accounts SET balance = balance + 100 WHERE account_id = 2;

    COMMIT;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK;
    THROW;
END CATCH;

If either statement errors, control jumps to CATCH, the transaction is rolled back, and THROW re-raises the error to the caller.

Follow-up questions to expect

What does the isolation level control, and what are the standard levels?
What is the difference between COMMIT and ROLLBACK, and what does @@TRANCOUNT do?
How do nested transactions and savepoints behave in SQL Server?
What happens to a transaction if the connection is lost before it commits?

11 What is the difference between EXISTS and IN?

Both IN and EXISTS let you filter a query using the result of a subquery, but they ask different questions. IN asks "is this value equal to any value in this list?" and works over a set of scalar values. EXISTS asks "does this correlated subquery return at least one row?" and only cares about presence, not about the values themselves.

Because EXISTS only checks for the existence of a row, it short-circuits: the moment the subquery finds one matching row it stops and returns true. That is why an EXISTS subquery often uses SELECT 1 instead of selecting real columns. IN instead materializes the full list of values before comparing, which is why it pairs naturally with a subquery that returns a single column.

The most important difference in an interview is how they handle NULLs. A NOT IN list that contains even one NULL can silently return no rows at all, because SQL cannot prove a value is "not equal to" an unknown. NOT EXISTS does not have this problem: it evaluates to true or false per row and treats a non-matching correlated subquery as simply "no row found".

IN: compares one value to a list or single-column subquery; the subquery is independent (not correlated).
EXISTS: takes a correlated subquery and returns true as soon as one row is found.
NOT IN + a NULL in the list = zero rows returned, a classic and hard-to-spot bug.
NOT EXISTS is the safe way to write an anti-join.

On performance the two are usually similar on modern optimizers, which often rewrite one form into the other or into a semi join. The practical rule is: reach for EXISTS / NOT EXISTS when the subquery is correlated or when negation and NULLs are in play, and use IN for a short static list or a simple uncorrelated subquery. See the EXISTS documentation on Microsoft Learn for the exact semantics.

EXISTS: customers that have placed at least one order

SELECT c.name
FROM Customers AS c
WHERE EXISTS (
    SELECT 1
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
);

The correlated subquery references c.customer_id and returns true the instant one matching order is found.

The NOT IN NULL trap versus NOT EXISTS

-- If any returned customer_id is NULL, this returns ZERO rows:
SELECT c.name
FROM Customers AS c
WHERE c.customer_id NOT IN (
    SELECT o.customer_id FROM Orders AS o
);

-- Safe rewrite: finds customers with no orders, NULLs and all:
SELECT c.name
FROM Customers AS c
WHERE NOT EXISTS (
    SELECT 1
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
);

A single NULL in the NOT IN list makes every comparison unknown, so no row qualifies. NOT EXISTS is unaffected.

Follow-up questions to expect

Why can NOT IN return no rows when the subquery contains a NULL?
How does the optimizer treat IN and EXISTS differently, if at all?
When would you use IN with a static list instead of a join?
How would you rewrite a NOT EXISTS anti-join as a LEFT JOIN with IS NULL?

12 What is the difference between COALESCE and ISNULL?

COALESCE and ISNULL both replace a NULL with a fallback value, but they are not the same function. COALESCE is part of the ANSI SQL standard and works across most database engines, while ISNULL is proprietary to SQL Server. You can read the full reference on the COALESCE function page.

Number of arguments. COALESCE(a, b, c, ...) accepts two or more arguments and returns the first one that is not NULL. ISNULL(check, replacement) takes exactly two arguments.
Return data type. COALESCE uses data type precedence rules across all of its arguments (it is essentially shorthand for a CASE expression). ISNULL returns the data type of the first argument, which can silently truncate the replacement value.
NULLability of the result. A computed column built with ISNULL is often marked NOT NULL, whereas one built with COALESCE is usually treated as NULLable. This matters when a SELECT INTO creates a new table from the expression.

A subtle trap: because COALESCE expands to a CASE expression, an argument that contains a subquery can be evaluated more than once. If that subquery is expensive or non-deterministic, COALESCE may run it twice, while ISNULL evaluates each argument once. For simple two argument NULL replacement inside SQL Server, ISNULL is fine; when you need portability, more than two fallbacks, or standard behaviour, prefer COALESCE.

See the official Microsoft Learn COALESCE reference for the full precedence rules.

COALESCE: first non-NULL across several columns

SELECT
    emp_id,
    COALESCE(mobile_phone, home_phone, work_phone, 'no phone on file') AS contact
FROM Employees;

COALESCE scans the arguments left to right and returns the first one that is not NULL, so it can chain many fallbacks in a single expression.

ISNULL: two argument replacement (SQL Server only)

SELECT
    order_id,
    ISNULL(discount, 0) AS discount
FROM Orders;

ISNULL takes exactly two arguments and returns the data type of the first, so a longer replacement string could be truncated to that column length.

Follow-up questions to expect

Why can COALESCE evaluate a subquery more than once?
How does the return data type of ISNULL differ from COALESCE?
Is the result of COALESCE NULLable, and why does that matter for SELECT INTO?
Which of the two would you use for cross database portability?

13 What is the CASE expression and how do you use it?

The CASE expression is SQL's way of writing if/then/else logic inline. It evaluates a list of conditions in order and returns the value from the first branch that is true, which makes it the go-to tool for deriving a column, bucketing values, or reshaping data without leaving the query.

There are two forms. A searched CASE writes a full boolean test after each WHEN (for example WHEN price > 100), so it can compare ranges, use AND/OR, and test different columns. A simple CASE names an expression once after CASE and then lists values to match (CASE status WHEN 1 THEN ...); it only checks equality against that single expression. The searched form is more flexible and is what you will use most often.

Because CASE is an expression, not a statement, it can appear almost anywhere a value is allowed: in the SELECT list, in WHERE, in ORDER BY for custom sort orders, in GROUP BY, and inside aggregate functions. If none of the WHEN branches match and there is no ELSE, the result is NULL, so add an ELSE when you want a defined default.

Evaluated top to bottom: the first matching WHEN wins, so order your conditions from most specific to most general.
Single return type: every branch must return a compatible data type, since CASE yields one column.
Conditional aggregation: wrapping CASE in an aggregate, like SUM(CASE WHEN region = 'East' THEN 1 ELSE 0 END), lets you count or total only the rows that match a condition. This trick is how you pivot rows into columns and build cross-tab reports in one pass.

CASE also underpins several everyday patterns: it is the manual equivalent of COALESCE for NULL handling, and understanding it is the foundation for pivoting rows to columns. See the CASE function reference for the full syntax and more examples.

Searched CASE: bucket customers by order total

SELECT
    customer_id,
    total,
    CASE
        WHEN total >= 1000 THEN 'VIP'
        WHEN total >= 100  THEN 'Regular'
        ELSE 'New'
    END AS tier
FROM Orders
ORDER BY total DESC;

Conditions are checked top to bottom; the first true WHEN wins, and ELSE catches everything else.

Conditional aggregation: pivot order counts by status

SELECT
    customer_id,
    SUM(CASE WHEN status = 'Shipped'   THEN 1 ELSE 0 END) AS shipped,
    SUM(CASE WHEN status = 'Pending'   THEN 1 ELSE 0 END) AS pending,
    SUM(CASE WHEN status = 'Cancelled' THEN 1 ELSE 0 END) AS cancelled
FROM Orders
GROUP BY customer_id;

Wrapping CASE in SUM turns rows into columns, counting only the rows that match each condition in a single pass.

Follow-up questions to expect

What is the difference between a searched CASE and a simple CASE?
What does CASE return when no WHEN matches and there is no ELSE?
How would you use CASE to pivot rows into columns?
How does CASE relate to COALESCE and ISNULL for handling NULLs?

14 How do you find duplicate rows in a table?

Start by deciding what "duplicate" means. Two rows are duplicates when the columns you care about hold the same values, even if other columns (like a surrogate id) differ. Once you know those key columns, finding duplicates is a counting problem: group by the key columns and keep only the groups that appear more than once.

The classic approach is GROUP BY with HAVING. You group on the key columns and filter with HAVING COUNT(*) > 1. Remember that HAVING filters groups after aggregation, whereas WHERE filters individual rows before grouping, so the count test has to live in HAVING. This tells you which value combinations are duplicated and how many times.

The limitation is that GROUP BY collapses each group into a single output row, so you see the duplicated values but not the individual offending rows. When you need the actual rows (for example to inspect or delete them), use ROW_NUMBER() with PARTITION BY on the key columns. Each partition is numbered from 1, so any row with rn > 1 is a repeat of one you have already seen.

GROUP BY / HAVING answers "which values are duplicated and how often" in one compact result.
ROW_NUMBER() OVER (PARTITION BY ...) tags every physical row so you can keep the first and flag or delete the rest.
Use PARTITION BY for the columns that define a duplicate and ORDER BY for the tie-breaker that decides which copy to keep.

Once you have identified them, removing duplicates is usually done with the same ROW_NUMBER() query wrapped in a CTE and a DELETE. See the walkthrough on how to delete duplicate rows for the safe pattern that keeps exactly one copy of each.

GROUP BY / HAVING: which values are duplicated

SELECT email, COUNT(*) AS copies
FROM Customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY copies DESC;

Groups rows by email and returns only the values that occur more than once, with a count of how many times.

ROW_NUMBER(): flag each individual duplicate row

WITH Ranked AS (
    SELECT
        customer_id,
        email,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY customer_id
        ) AS rn
    FROM Customers
)
SELECT customer_id, email, rn
FROM Ranked
WHERE rn > 1;

PARTITION BY email restarts the counter for each email; rows with rn > 1 are the extra copies. Swap the final SELECT for DELETE to remove them.

Follow-up questions to expect

How would you delete the duplicate rows while keeping exactly one copy?
Why must the COUNT(*) > 1 test go in HAVING and not WHERE?
What does PARTITION BY do differently from GROUP BY?
How would you find duplicates based on several columns rather than one?

15 What is an anti join and how do you find rows with no match?

An anti join is the opposite of an inner join. Where an inner join keeps only rows that match, an anti join keeps only the rows in one table that have no matching row in the other. The classic interview phrasing is "find the customers who have never placed an order" or "list products that were never sold". SQL has no dedicated ANTI JOIN keyword, so you express the pattern with one of three idioms.

1. LEFT JOIN ... WHERE right.key IS NULL. Do a LEFT JOIN so every left row survives, then filter to the rows where the right side came back NULL. Those NULLs are exactly the rows that found no match. It is readable and works everywhere, but you must filter on a right column that is never NULL in real data, such as its primary key.

2. NOT EXISTS (preferred). A correlated NOT EXISTS subquery is usually the best choice: it reads as "keep this row when no matching row exists", the optimizer turns it into an efficient anti semi join, and it is NULL-safe. EXISTS only cares whether a row is returned, so NULLs in the joined column do not corrupt the result.

3. NOT IN (use with care). NOT IN (subquery) looks clean but has a notorious trap: if the subquery returns even one NULL, the whole NOT IN evaluates to unknown and the query returns zero rows. This is because x NOT IN (1, NULL) can never be proven true. Only use NOT IN when the column is guaranteed NOT NULL, otherwise reach for NOT EXISTS.

Which to prefer: default to NOT EXISTS for correctness and performance. Use LEFT JOIN ... IS NULL when it reads more naturally to your team or you already have the join. Avoid NOT IN on nullable columns. See also EXISTS vs IN and INNER vs LEFT JOIN.

Anti join with LEFT JOIN and IS NULL

SELECT c.customer_id, c.name
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id
WHERE o.customer_id IS NULL;

Every customer survives the LEFT JOIN; keeping only the NULL right-side rows leaves customers with no order.

Same result with NOT EXISTS (preferred, NULL-safe)

SELECT c.customer_id, c.name
FROM Customers AS c
WHERE NOT EXISTS (
    SELECT 1
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
);

NOT EXISTS returns a customer only when no matching order exists, and NULLs in Orders cannot break it.

Follow-up questions to expect

Why can NOT IN return zero rows when the subquery contains a NULL?
How does the optimizer execute NOT EXISTS differently from NOT IN?
How would you rewrite a LEFT JOIN ... IS NULL anti join as an EXCEPT?
What index would make an anti join between Customers and Orders fast?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.

Get 1-on-1 Prep All Categories