Home › SQL Interview Questions › Advanced

SQL Interview Prep

Advanced SQL Interview Questions

Q: What is a recursive CTE and when do you need one?

A recursive CTE is a common table expression that references itself, built from an anchor member unioned with a recursive member that repeats until it returns no new rows. You need one to walk hierarchies such as an org chart or bill of materials, or to generate a sequence or date range.

Q: What is a window frame, and what is the difference between ROWS and RANGE?

A window frame defines which rows inside the partition feed a windowed aggregate. ROWS counts physical rows, while RANGE groups peer rows with equal ORDER BY values so tied rows are always included together.

Q: How do you calculate a running total with a window function?

Use SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). The ORDER BY defines the accumulation sequence, and the ROWS frame sums every row from the start up to the current one. Add PARTITION BY to restart the total per group.

Q: How do PIVOT and UNPIVOT work?

PIVOT rotates row values into columns by applying an aggregate to a value column, turning a tall result into a wide one. UNPIVOT does the reverse, folding several columns back into rows. The portable equivalent of PIVOT is conditional aggregation with SUM(CASE WHEN ...).

Q: What are the transaction isolation levels and what anomalies do they prevent?

The four standard levels are READ UNCOMMITTED, READ COMMITTED (the SQL Server default), REPEATABLE READ and SERIALIZABLE. Each higher level prevents more read anomalies: dirty reads, non-repeatable reads and phantoms, in exchange for reduced concurrency.

Q: What causes a deadlock and how do you prevent one?

A deadlock happens when two transactions each hold a lock the other needs, forming a cycle so neither can proceed. SQL Server picks a victim and rolls it back with error 1205. You prevent them by accessing objects in a consistent order, keeping transactions short, indexing well, and adding retry logic.

Q: What is a covering index and how does an included column help?

A covering index contains every column a query needs, either in the key or as INCLUDE columns, so the engine answers the query from the index alone and skips the key lookup back to the table. INCLUDE columns live only at the leaf level and do not affect key order.

Q: What makes a predicate sargable, and why does it matter?

A predicate is sargable (Search ARGument ABLE) when the engine can satisfy it with an index seek because the indexed column is compared directly, not wrapped in a function or conversion. Sargable predicates let the optimizer seek to the rows it needs; non-sargable ones force a full scan of every row.

Q: What is the difference between a temp table and a table variable?

A temp table (#t) lives in tempdb, carries column statistics, supports full indexing and participates in transactions, so it suits larger result sets. A table variable (@t) has no column statistics, is not rolled back, and is best for small sets.

Q: How does the MERGE statement work for an upsert?

MERGE compares a source set to a target table on a join key and, in one statement, runs INSERT, UPDATE or DELETE depending on whether each row matched: an upsert. Because MERGE has a history of bugs and concurrency traps, many teams prefer a separate UPDATE then INSERT guarded by locking hints.

Advanced questions probe depth: recursive CTEs, window frame clauses, isolation levels, indexing internals and query rewriting. These are the questions that distinguish senior candidates. Every answer pairs a diagram with concrete SQL.

Advanced 15 questions

1 What is a recursive CTE and when do you need one?

A recursive CTE is a common table expression that refers to itself inside its own definition. It lets a single query loop over rows that point at other rows in the same table, which is something a plain JOIN cannot do when the depth is unknown ahead of time.

Every recursive CTE has the same two part structure joined by UNION ALL:

The anchor member is the base query. It runs once and produces the starting rows (for example the top manager, or the first number in a series).
The recursive member references the CTE by name and joins back to the previous result. It runs again and again, each pass feeding on the rows the last pass produced.
Recursion stops on its own when the recursive member returns no new rows, so you must include a termination condition to avoid an infinite loop.

The classic use is walking a hierarchy stored as a self referencing table: an org chart where each employee row holds a manager_id, or a bill of materials where each part is assembled from sub parts. Because the number of levels is not known in advance, you cannot write a fixed set of joins; the recursion follows the chain to whatever depth it reaches. Recursive CTEs are also handy to generate sequences, such as a run of integers or a continuous date range for a calendar or report with no gaps.

In SQL Server the recursion depth is capped at 100 passes by default. If a legitimate hierarchy is deeper, or you build a longer series, raise the limit with the OPTION (MAXRECURSION n) query hint, where n is 0 to 32767. Setting MAXRECURSION 0 removes the cap entirely, so use it only when you are certain the query terminates. See the Microsoft Learn docs on CTEs for the full syntax.

Walk an org chart from the top manager down

WITH OrgChart AS (
    -- Anchor member: employees with no manager (the top)
    SELECT emp_id, name, manager_id, 0 AS level
    FROM Employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive member: attach each report to its manager
    SELECT e.emp_id, e.name, e.manager_id, oc.level + 1
    FROM Employees AS e
    INNER JOIN OrgChart AS oc
        ON e.manager_id = oc.emp_id
)
SELECT emp_id, name, manager_id, level
FROM OrgChart
ORDER BY level, name;

The anchor picks the top manager; the recursive member follows manager_id down one level per pass until no rows report to the last set.

Generate a continuous date series with MAXRECURSION

WITH Dates AS (
    -- Anchor member: the first date
    SELECT CAST('2026-01-01' AS date) AS d

    UNION ALL

    -- Recursive member: add one day, stop at the end date
    SELECT DATEADD(DAY, 1, d)
    FROM Dates
    WHERE d < '2026-12-31'
)
SELECT d
FROM Dates
ORDER BY d
OPTION (MAXRECURSION 366);

The WHERE clause is the termination condition; MAXRECURSION 366 lifts the default 100 pass cap so a full year of dates fits.

Follow-up questions to expect

What is the difference between the anchor member and the recursive member?
What happens if a recursive CTE has no termination condition?
What does OPTION (MAXRECURSION 0) do, and why is it risky?
How would you detect or prevent an infinite loop caused by a cycle in the data?

2 What is a window frame, and what is the difference between ROWS and RANGE?

A window frame is the slice of rows within the current partition that a window function actually reads to produce its value for the current row. You write it inside the OVER clause after ORDER BY, for example ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The frame slides as the window moves from row to row, which is what turns SUM into a running total instead of a single grand total.

A frame has two parts: a frame unit (ROWS or RANGE) and a set of bounds such as UNBOUNDED PRECEDING, CURRENT ROW, or N PRECEDING. The unit decides how those bounds are measured.

ROWS counts physical rows. CURRENT ROW means exactly this one row, so the frame stops at the row you are on regardless of its value.
RANGE works on the ORDER BY value, not the row position. Every row that shares the same ORDER BY value is a peer, and CURRENT ROW under RANGE extends the frame to include all peers. That means tied rows are always pulled in together.

The common surprise is the default frame. When you supply ORDER BY in the OVER clause but omit an explicit frame, the engine applies RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. If your ORDER BY column has duplicate values, a running total then jumps in blocks because every tied row receives the same cumulative sum. Switching to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW makes the total advance one row at a time.

Rule of thumb: use ROWS when you want a precise, row-by-row running total or moving window, and reach for RANGE only when you deliberately want tied rows treated as a single group. See the OVER clause documentation for the full frame grammar.

ROWS: a precise running total, one row at a time

SELECT order_date, amount,
       SUM(amount) OVER (
           ORDER BY order_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS running_total
FROM Sales
ORDER BY order_date;

Two orders on the same date get different running totals, because ROWS counts each physical row separately.

RANGE (the default): tied dates share one cumulative value

SELECT order_date, amount,
       SUM(amount) OVER (
           ORDER BY order_date
           RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS running_total
FROM Sales
ORDER BY order_date;

Omitting the frame gives this same RANGE behaviour: every row sharing a date is a peer, so tied dates all show the identical running total.

Follow-up questions to expect

What frame does SQL Server apply if you use ORDER BY in OVER but write no ROWS or RANGE clause?
How would you compute a 3 row moving average with a window frame?
Why can a running total using the default frame appear to skip ahead in blocks?
Can you use a window frame with RANK or ROW_NUMBER, and why or why not?

3 How do you calculate a running total with a window function?

A running total (or cumulative sum) adds each row to the sum of all rows before it. The modern way is a windowed SUM() aggregate: SUM(amount) OVER (ORDER BY order_date ...). The OVER clause turns the aggregate into a window function, so it returns a value for every row instead of collapsing the result into one grand total.

Two parts of the OVER clause control the result. The ORDER BY sets the sequence in which rows accumulate, and the frame (ROWS BETWEEN ...) defines the window of rows that feed into each running value. For a classic running total the frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: everything from the first row up to and including the current one.

Always specify ROWS explicitly. If you write ORDER BY order_date with no frame, the SQL standard default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. RANGE groups by value, so any rows that tie on order_date all receive the same total (the sum through the last tied row) instead of a true row by row running total. Writing ROWS avoids that surprise and is usually faster.

To keep a separate running total per group - say per customer - add PARTITION BY customer_id. The sum then resets at each new partition, and the ORDER BY orders rows within it. This is the pattern interviewers are looking for when they ask for a per-customer or per-account cumulative balance. See the SQL window functions guide for a deeper walkthrough of frames and partitions.

Before window functions existed, people computed running totals with a correlated subquery such as (SELECT SUM(x) FROM T t2 WHERE t2.d <= t1.d). That re-scans the table once per row (O(n^2) work) and is awkward to read. The windowed SUM does it in a single ordered pass, so it is both clearer and dramatically faster on large tables.

Overall running total ordered by date

SELECT
    order_date,
    amount,
    SUM(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM Orders
ORDER BY order_date;

Each row shows the cumulative sum of amount from the first row through the current one. The explicit ROWS frame avoids the RANGE tie default.

Per-customer running total with PARTITION BY

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS customer_running_total
FROM Orders
ORDER BY customer_id, order_date;

PARTITION BY customer_id restarts the total for each customer, so you get an independent cumulative balance per account.

Follow-up questions to expect

What is the difference between the ROWS and RANGE frame clauses?
Why can a windowed SUM be much faster than a correlated subquery for a running total?
How would you compute a moving average over the last 3 rows instead of a running total?
What happens to the running total when two rows have the same value in the ORDER BY column?

4 How do PIVOT and UNPIVOT work?

The PIVOT operator rotates the distinct values of one column into a set of output columns, applying an aggregate to a value column for each new column. You take a tall, narrow result (one row per category) and turn it into a wide, short one (one column per category). It is the classic way to build a cross tab or report grid, for example turning a list of monthly sales into one column per month.

Every PIVOT has three moving parts: the aggregate and the value column it runs on (such as SUM(amount)), the spreading column whose distinct values become new column names (listed in the FOR ... IN (...) clause), and the grouping columns, which are simply every other column you keep in the source query. SQL Server groups by those remaining columns implicitly, so you control the grouping by choosing exactly which columns feed the PIVOT.

UNPIVOT does the reverse: it takes several columns and folds them into two columns, one holding the former column name and one holding its value. It is useful for normalizing a wide spreadsheet style table back into rows so you can filter, join, or aggregate over it. Note that UNPIVOT drops NULLs, so it is not a perfect inverse of PIVOT if the pivoted grid had empty cells.

The portable alternative to PIVOT is conditional aggregation: SUM(CASE WHEN col = value THEN metric END) as one expression per output column, wrapped in a normal GROUP BY. This works on every database, reads clearly, and lets you mix several aggregates in one pass. Many engineers prefer it to the PIVOT syntax for exactly those reasons. See the CASE expression and SUM for the building blocks.

The big limitation is that both PIVOT and CASE require you to list the target column values in advance. When those values are not known until run time (for example an unknown set of months or product codes), you need dynamic PIVOT: query the distinct values, build the column list into a string, and execute it with dynamic SQL via sp_executesql.

PIVOT: total sales per quarter, one column per quarter

SELECT product, [Q1], [Q2], [Q3], [Q4]
FROM (
    SELECT product, quarter, amount
    FROM Sales
) AS src
PIVOT (
    SUM(amount)
    FOR quarter IN ([Q1], [Q2], [Q3], [Q4])
) AS pvt
ORDER BY product;

The distinct quarter values become columns; SUM(amount) fills each cell. product is the implicit grouping column because it is the only other column in src.

Equivalent conditional aggregation (portable, no PIVOT keyword)

SELECT product,
       SUM(CASE WHEN quarter = 'Q1' THEN amount END) AS [Q1],
       SUM(CASE WHEN quarter = 'Q2' THEN amount END) AS [Q2],
       SUM(CASE WHEN quarter = 'Q3' THEN amount END) AS [Q3],
       SUM(CASE WHEN quarter = 'Q4' THEN amount END) AS [Q4]
FROM Sales
GROUP BY product
ORDER BY product;

Same result as the PIVOT, but runs on any database. Each CASE selects the rows for one quarter and SUM ignores the NULLs the CASE returns for other quarters.

Follow-up questions to expect

How would you write a dynamic PIVOT when the column values are not known in advance?
Why does UNPIVOT drop NULL values, and how can that bite you?
When would you prefer SUM(CASE WHEN ...) over the PIVOT operator?
Can you produce more than one aggregate (for example SUM and COUNT) in a single PIVOT?

5 What are the transaction isolation levels and what anomalies do they prevent?

An isolation level controls how much one transaction is affected by the concurrent work of other transactions. It is the I in ACID. The SQL standard defines four levels, and each one is defined by which of three read anomalies it permits.

The three anomalies, from least to most subtle, are:

Dirty read: you read a row another transaction has changed but not yet committed. If that transaction rolls back, you read data that never truly existed.
Non-repeatable read: you read the same row twice in one transaction and get different values because another transaction updated and committed it in between.
Phantom read: you run the same range query twice and the second run returns extra rows because another transaction inserted rows that match your predicate.

The four standard levels build on each other:

READ UNCOMMITTED allows all three anomalies. Readers take no shared locks and can see uncommitted data. The NOLOCK hint is the same thing applied to one table.
READ COMMITTED is the default in SQL Server. It prevents dirty reads but still allows non-repeatable reads and phantoms.
REPEATABLE READ also prevents non-repeatable reads by holding shared locks on rows you have read until the transaction ends, but phantoms are still possible.
SERIALIZABLE prevents all three by taking range locks, so the result is as if transactions ran one at a time.

SQL Server also offers row versioning so readers do not block writers. SNAPSHOT isolation gives each transaction a transaction-consistent view from the version store in tempdb, avoiding all three anomalies without holding read locks. READ_COMMITTED_SNAPSHOT is a database option that makes the ordinary READ COMMITTED level use row versioning at the statement level, which removes most reader/writer blocking with almost no code change.

The core trade-off is consistency versus concurrency: stricter levels give you more correctness guarantees but hold locks longer, which lowers throughput and raises the chance of blocking and deadlocks (error 1205). See transactions, locks and deadlocks for how the locking behaves in practice. Microsoft documents the details in the SET TRANSACTION ISOLATION LEVEL reference.

Set the isolation level for a transaction

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

BEGIN TRANSACTION;
    SELECT balance FROM Accounts WHERE account_id = 1;
    -- other work here; the row read above cannot change
    SELECT balance FROM Accounts WHERE account_id = 1;
COMMIT TRANSACTION;

REPEATABLE READ holds a shared lock on the read row until commit, so the second SELECT returns the same value.

Enable row versioning (SNAPSHOT and READ_COMMITTED_SNAPSHOT)

ALTER DATABASE Sales SET ALLOW_SNAPSHOT_ISOLATION ON;
ALTER DATABASE Sales SET READ_COMMITTED_SNAPSHOT ON;

-- opt a single transaction into snapshot isolation
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRANSACTION;
    SELECT * FROM Orders WHERE status = 'open';
COMMIT TRANSACTION;

With versioning enabled readers get a consistent view from tempdb instead of blocking writers.

Follow-up questions to expect

Why is READ COMMITTED the default rather than SERIALIZABLE?
What is the difference between SNAPSHOT and READ_COMMITTED_SNAPSHOT isolation?
What overhead does row versioning add to tempdb, and how do you monitor it?
How can a higher isolation level increase the chance of a deadlock?

6 What causes a deadlock and how do you prevent one?

A deadlock occurs when two (or more) transactions each hold a lock that the other one needs, so they end up waiting on each other in a cycle that can never resolve on its own. Session A holds a lock on row 1 and wants row 2; session B holds a lock on row 2 and wants row 1. Neither will release what it has until it gets the thing the other is holding, so both would wait forever.

SQL Server runs a background deadlock monitor (the lock monitor thread) that periodically looks for these cycles in the wait-for graph. When it finds one it breaks the tie by choosing a deadlock victim, rolls that transaction back, releases its locks so the other can continue, and returns error 1205 to the losing session. By default the engine picks the victim with the least expensive transaction to roll back, but you can influence the choice with SET DEADLOCK_PRIORITY.

The key thing to say in an interview is that a deadlock is different from ordinary blocking. Blocking is one session simply waiting for another to release a lock, and it resolves as soon as the first transaction commits. A deadlock is a circular wait that cannot resolve itself, which is why the engine has to intervene and kill one side. For the wider picture of how locks and blocking lead up to this, see transactions, locks and deadlocks.

You prevent deadlocks by reducing either the chance of a cycle or the size of the lock footprint:

Access objects in a consistent order. If every transaction touches Accounts before Orders, two sessions cannot form the opposing cycle that causes a deadlock.
Keep transactions short. Do not hold a transaction open across user input, application round trips, or long computations. The shorter the lock is held, the smaller the window for a conflict.
Index the right columns. A query with no useful index scans and locks many rows; a seek on a good index locks just the rows it touches, shrinking the lock footprint and the chance of overlap. See SQL indexes.
Lower the isolation level or enable RCSI. Turning on READ_COMMITTED_SNAPSHOT (RCSI) lets readers use row versions instead of shared locks, so reads no longer block writers and vice versa, removing a large class of deadlocks.
Add retry logic. Because a deadlock victim is rolled back cleanly, error 1205 is safe to retry. Wrap the transaction in a loop that catches 1205 and re-runs it a few times.

To diagnose one you need the deadlock graph, which shows both processes, the resources they held and requested, and which one was chosen as the victim. The easiest way to capture it is Extended Events: the system_health session already records the xml_deadlock_report event, so recent deadlock graphs are usually there without any setup. You can also add a Trace Flag (1222) to log the graph to the error log, or build a dedicated Extended Events session.

Two sessions that deadlock by locking rows in opposite order

-- Session A
BEGIN TRAN;
UPDATE Accounts SET balance = balance - 100 WHERE id = 1;  -- locks row 1
-- (small delay, then it asks for row 2)
UPDATE Accounts SET balance = balance + 100 WHERE id = 2;  -- waits for row 2
COMMIT;

-- Session B (running at the same time)
BEGIN TRAN;
UPDATE Accounts SET balance = balance - 50 WHERE id = 2;   -- locks row 2
-- (small delay, then it asks for row 1)
UPDATE Accounts SET balance = balance + 50 WHERE id = 1;    -- waits for row 1
COMMIT;

Each session holds one row and asks for the other, forming a cycle. SQL Server rolls one back with error 1205. Having both update id = 1 before id = 2 would avoid it.

Retry logic that catches error 1205 and re-runs the transaction

DECLARE @attempt int = 0, @maxRetries int = 3;
WHILE @attempt < @maxRetries
BEGIN
    BEGIN TRY
        BEGIN TRAN;
            UPDATE Accounts SET balance = balance - 100 WHERE id = 1;
            UPDATE Accounts SET balance = balance + 100 WHERE id = 2;
        COMMIT TRAN;
        BREAK;  -- success, leave the loop
    END TRY
    BEGIN CATCH
        IF XACT_STATE() <> 0 ROLLBACK TRAN;
        IF ERROR_NUMBER() = 1205 AND @attempt < @maxRetries - 1
        BEGIN
            SET @attempt = @attempt + 1;
            WAITFOR DELAY '00:00:00.200';  -- brief backoff, then retry
        END
        ELSE
            THROW;  -- not a deadlock, or out of retries: re-raise
    END CATCH
END

Because the victim is rolled back cleanly, error 1205 is safe to retry. The loop backs off briefly and re-runs, and re-raises anything that is not a deadlock.

Follow-up questions to expect

How does a deadlock differ from ordinary blocking?
How does SQL Server decide which transaction becomes the deadlock victim?
How would you capture and read a deadlock graph in production?
How does enabling READ_COMMITTED_SNAPSHOT reduce deadlocks?

7 What is a covering index and how does an included column help?

A covering index is a nonclustered index that holds all of the columns a particular query references - those used for filtering, joining, ordering and in the SELECT list. Because every needed value is already in the index, the query is satisfied by reading the index alone and never has to go back to the base table. In an execution plan this shows up as an Index Seek with no accompanying Key Lookup (or RID Lookup on a heap).

Without covering, a nonclustered index stores only its key columns plus a pointer (the clustered key or a row id). If the query asks for extra columns, the engine must follow that pointer for each qualifying row to fetch the missing values. That extra step is the key lookup, and when many rows match it can dominate the cost or push the optimizer to ignore the index and scan the whole table instead.

The INCLUDE clause is how you cover a query cheaply. Included columns are stored only at the leaf level of the index, not in the intermediate (b-tree) levels, and they are not part of the key. That means they do not affect the sort order of the index and are not used for seeking - they are simply carried along so a lookup is unnecessary. This lets you keep a narrow, selective key while still covering the columns the query returns.

Put columns you search, join or order by in the index key, because key order determines what a seek can do.
Put columns you only return in the SELECT list in INCLUDE, since order does not matter for them.
INCLUDE also lets you cover data types that are not allowed as key columns and avoids bumping into the index key size limit.

The trade-off is width. Every column you add makes the index larger on disk and in memory, and every INSERT, UPDATE or DELETE that touches those columns must also maintain the index. So cover the queries that matter for performance tuning, not every column of the table.

Create a covering index with INCLUDE columns

CREATE NONCLUSTERED INDEX IX_Orders_Customer_Date
ON Orders (customer_id, order_date)
INCLUDE (total, status);

customer_id and order_date are the seekable key; total and status ride along at the leaf so the query does not need a key lookup.

A query that is now covered by that index

SELECT order_date, total, status
FROM Orders
WHERE customer_id = 42
ORDER BY order_date;

Every referenced column lives in the index, so the plan is a single Index Seek with no Key Lookup back to the table.

Follow-up questions to expect

What is the difference between a key column and an INCLUDE column in an index?
How can you tell from an execution plan that a query is not covered?
What is the downside of adding many INCLUDE columns to an index?
Why does column order in the index key matter for a seek?

8 What makes a predicate sargable, and why does it matter?

Sargable is a contraction of Search ARGument ABLE. A predicate is sargable when the query engine can use it as a search argument against an index, meaning it can perform an index seek that jumps straight to the qualifying rows instead of reading the whole table. Whether a WHERE (or JOIN or ON) predicate is sargable is one of the biggest levers you have over query cost, which is why it comes up in performance tuning interviews.

The rule of thumb: the indexed column must appear bare on one side of the comparison. The moment you wrap the column in a function or force a conversion, the engine can no longer use the index order, because the index stores the raw column values, not the computed result. Common ways to accidentally break sargability:

Wrapping the column in a function, for example YEAR(order_date) = 2024 or LEFT(name, 3) = 'abc'. The index is on order_date, not on YEAR(order_date).
A leading wildcard in LIKE, such as name LIKE '%son'. A trailing wildcard like 'son%' is still sargable because the prefix is fixed, but a leading % forces a scan.
Implicit conversion, for example comparing an NVARCHAR column to a numeric literal, or an NVARCHAR parameter against a VARCHAR column. SQL Server converts the column to match the literal and the seek is lost. See error 245 for the conversion failures this same mismatch can throw.
Arithmetic on the column, such as salary * 12 > 100000 instead of salary > 100000 / 12.

The fix is almost always to rewrite the predicate so the column stands alone and the transformation moves to the literal side. The classic example is turning a function over a date column into a half-open range: instead of YEAR(order_date) = 2024 you write order_date >= '2024-01-01' AND order_date < '2025-01-01'. This returns the same rows but lets the optimizer seek the range in the index.

Why it matters: a non-sargable predicate typically produces a clustered or table scan, so cost grows with total table size and stays high even when only a handful of rows match. A sargable rewrite turns that into a seek whose cost tracks the number of matching rows. On a large table that is the difference between milliseconds and seconds. Note that sargability only helps if a suitable index exists in the first place; making a predicate sargable does not create the index for you.

Non-sargable: function on the column forces a scan

-- YEAR() wraps the indexed column, so the index on order_date cannot be seeked
SELECT order_id, customer_id, total
FROM Orders
WHERE YEAR(order_date) = 2024;

The engine must compute YEAR() for every row, producing a full scan regardless of how many rows fall in 2024.

Sargable rewrite: a half-open date range enables a seek

-- The column is bare on the left, so an index on order_date can seek the range
SELECT order_id, customer_id, total
FROM Orders
WHERE order_date >= '2024-01-01'
  AND order_date <  '2025-01-01';

Same result set as the YEAR() version, but the optimizer seeks directly to the 2024 rows. The < upper bound avoids missing time components on Dec 31.

Follow-up questions to expect

Why is LIKE 'son%' sargable but LIKE '%son' is not?
How can an implicit type conversion silently turn a seek into a scan?
When would a non-sargable predicate still not matter for performance?
How does a computed column or a filtered index help with an otherwise non-sargable predicate?

9 What is the difference between a temp table and a table variable?

Both a temporary table and a table variable store an intermediate result set, and both physically live in tempdb. The difference that matters in an interview is how the optimizer treats them and how they behave inside a transaction, because that is what drives the plans you get for larger workloads.

A temp table is created with CREATE TABLE #t (or SELECT ... INTO #t). SQL Server maintains column statistics on it, so the optimizer can estimate row counts accurately. You can add clustered and nonclustered indexes, and it fully participates in transactions, which means a ROLLBACK undoes changes to it. That makes temp tables the right choice when the set is large enough that a bad row estimate would produce a bad plan.

A table variable is declared with DECLARE @t TABLE (...). It has no column statistics, so historically the optimizer estimated it as 1 row regardless of contents, which is fine for a handful of rows but disastrous for thousands. On SQL Server 2019 and later, table variable deferred compilation lets the optimizer see the real row count when the statement first compiles, which greatly improves estimates, though there are still no per-column histograms.

Statistics: temp table yes, table variable no.
Indexes: temp table full support (including after creation); table variable limited to a primary key or unique constraint declared inline.
Transactions: a temp table is affected by ROLLBACK; a table variable is not, so its rows survive a rolled-back transaction.
Scope: a local temp table is visible for the whole session and to nested stored procedures; a table variable is scoped to the batch, function or procedure that declares it.
Recompiles: temp tables can trigger statement recompiles as data changes; table variables avoid that overhead.

Rule of thumb: reach for a table variable for small sets (roughly a few hundred rows), inside functions, or when you want to avoid recompiles; reach for a temp table for larger sets, when you need indexes or statistics for a good plan, or when you want the intermediate data to honor transaction rollback. See also temp tables in stored procedures and performance tuning.

Temp table: create, index and use inside a transaction

CREATE TABLE #BigOrders (
    order_id  INT PRIMARY KEY,
    customer_id INT,
    total     DECIMAL(10,2)
);

INSERT INTO #BigOrders (order_id, customer_id, total)
SELECT order_id, customer_id, total
FROM Orders
WHERE total > 1000;

-- Add an index after load; the optimizer keeps statistics on it.
CREATE NONCLUSTERED INDEX IX_BigOrders_Customer
    ON #BigOrders (customer_id);

SELECT customer_id, COUNT(*) AS big_orders
FROM #BigOrders
GROUP BY customer_id;

DROP TABLE #BigOrders;

The temp table has statistics and a nonclustered index, so a large GROUP BY gets an accurate row estimate and a good plan.

Table variable: declare and populate for a small set

DECLARE @TopCustomers TABLE (
    customer_id INT PRIMARY KEY,
    total_spent DECIMAL(12,2)
);

INSERT INTO @TopCustomers (customer_id, total_spent)
SELECT TOP (20) customer_id, SUM(total)
FROM Orders
GROUP BY customer_id
ORDER BY SUM(total) DESC;

SELECT * FROM @TopCustomers ORDER BY total_spent DESC;

A handful of rows, no need for statistics or extra indexes; @t is scoped to this batch and is not touched by a ROLLBACK.

Follow-up questions to expect

Why did older versions of SQL Server estimate a table variable as one row, and what changed in 2019?
How does ROLLBACK behave differently for a temp table versus a table variable?
When would using a table variable cause a bad execution plan?
What is the difference between a local temp table and a global temp table?

10 How does the MERGE statement work for an upsert?

An upsert means "update the row if it already exists, otherwise insert it." The MERGE statement does this in a single pass: it joins a source (a table, view, subquery or table variable) to a target table on a key you specify, classifies every row as matched or not matched, and then fires one action per row. This lets you sync a table from a feed of changes without writing three separate statements.

The behavior is driven by up to three WHEN clauses:

WHEN MATCHED - the key exists in both source and target, so you usually UPDATE the target row (or DELETE it).
WHEN NOT MATCHED [BY TARGET] - the source row has no counterpart in the target, so you INSERT it. This is the "new row" case.
WHEN NOT MATCHED BY SOURCE - the target row has no counterpart in the source, so you can DELETE it (or update it, for example to flag it inactive). Use this one carefully: with an unfiltered source it can wipe rows you did not intend to touch.

MERGE looks elegant, but it carries real baggage. Over the years it has shipped with a long list of correctness and optimizer bugs, and Microsoft engineers like Aaron Bertrand have publicly cautioned against using it for anything important. Common traps include: it does not protect you from concurrent inserts by default, so two sessions can both decide a row is "not matched" and race to insert it (a primary key violation or duplicate); triggers, foreign keys and filtered indexes have historically interacted with it in surprising ways; and a single mistyped ON clause can update or delete the wrong rows. Because of this, many teams prefer an explicit UPDATE followed by an INSERT, which is easier to read, reason about and tune. See wrapping DML in a transaction for how to make the multi-statement version atomic.

For concurrency, the fix is to hold a range lock on the key while you decide. Adding the HOLDLOCK (equivalent to SERIALIZABLE) hint to the target - MERGE dbo.Target WITH (HOLDLOCK) - forces the statement to lock the key range it probed, so a competing session must wait instead of inserting a duplicate. The same idea applies to the UPDATE-then-INSERT pattern: take an UPDLOCK, HOLDLOCK when you test for existence, and run the whole thing inside a transaction so it either fully commits or rolls back. See the MERGE documentation on Microsoft Learn for the full syntax.

MERGE upsert: sync a target table from a staging source

MERGE dbo.Product AS tgt
USING dbo.ProductStaging AS src
    ON tgt.product_id = src.product_id
WHEN MATCHED AND (tgt.price <> src.price OR tgt.name <> src.name) THEN
    UPDATE SET tgt.name = src.name,
               tgt.price = src.price
WHEN NOT MATCHED BY TARGET THEN
    INSERT (product_id, name, price)
    VALUES (src.product_id, src.name, src.price)
WHEN NOT MATCHED BY SOURCE THEN
    DELETE;

One statement inserts new products, updates changed ones and deletes rows no longer in staging. The AND condition on WHEN MATCHED skips no-op updates.

The safer alternative: UPDATE then INSERT, atomic and locked

BEGIN TRAN;

UPDATE tgt
SET tgt.name = src.name,
    tgt.price = src.price
FROM dbo.Product AS tgt WITH (UPDLOCK, HOLDLOCK)
JOIN dbo.ProductStaging AS src
    ON tgt.product_id = src.product_id;

INSERT INTO dbo.Product (product_id, name, price)
SELECT src.product_id, src.name, src.price
FROM dbo.ProductStaging AS src
WHERE NOT EXISTS (
    SELECT 1 FROM dbo.Product AS tgt
    WHERE tgt.product_id = src.product_id
);

COMMIT;

UPDLOCK, HOLDLOCK takes a range lock so a concurrent session cannot insert the same key between the UPDATE and the INSERT. The transaction makes both steps commit or roll back together.

Follow-up questions to expect

Why do experienced DBAs often advise against using MERGE in production?
How does HOLDLOCK prevent a duplicate key race under concurrency?
What does WHEN NOT MATCHED BY SOURCE do, and why is it dangerous?
How would you rewrite an upsert without MERGE while keeping it atomic?

11 How do the LAG and LEAD window functions work?

LAG and LEAD are window functions that let a row read a value from another row in the same result set without a self join. LAG looks backward and returns a column value from a previous row, while LEAD looks forward and returns it from a following row.

Both are controlled by an OVER clause. The ORDER BY inside it defines what "previous" and "next" mean, so the ordering is required and matters. An optional PARTITION BY splits the data into groups and resets the sequence at each group boundary, so LAG on the first row of a partition has nothing to look back to.

The full signature is LAG(expr, offset, default). The offset is how many rows to step (default 1), and the default is the value returned when the step falls outside the partition (default NULL). Supplying a default is a clean way to avoid NULLs on the first row for LAG or the last row for LEAD.

Period-over-period change: subtract the prior row from the current row to get a day-over-day or month-over-month delta.
Detecting gaps: compare each date to the previous date to spot missing days or breaks in a sequence.
Comparing to the prior row: flag rows where a value increased, decreased, or stayed the same versus the one before it.

See the function references for exact syntax and edge cases: LAG and LEAD. Both are ordered analytic functions and cannot appear in a WHERE clause directly, so wrap the query in a CTE or subquery when you need to filter on their result.

LAG: day-over-day change in sales

SELECT
    sales_date,
    amount,
    LAG(amount, 1, 0) OVER (ORDER BY sales_date) AS prev_day,
    amount - LAG(amount, 1, 0) OVER (ORDER BY sales_date) AS day_change
FROM DailySales
ORDER BY sales_date;

LAG pulls the previous day amount; subtracting it gives the day-over-day difference. The default 0 avoids a NULL on the first day.

LEAD: next order date per customer to find gaps

SELECT
    customer_id,
    order_date,
    LEAD(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS next_order_date
FROM Orders
ORDER BY customer_id, order_date;

PARTITION BY restarts the sequence per customer, so LEAD returns each customer next order date, NULL on their most recent one.

Follow-up questions to expect

What does LAG return for the first row of a partition, and how do you change that?
Why can you not use LAG or LEAD directly in a WHERE clause?
How would you rewrite a LAG query using a self join, and why is the window function usually better?
What is the difference between the ORDER BY inside OVER and the ORDER BY of the outer query?

12 What is CROSS APPLY and how is it different from a JOIN?

CROSS APPLY is a T-SQL operator that runs a table-valued expression (a subquery, a derived table, or a table-valued function) once for every row of the left input, and then joins the left row to whatever rows that expression produced. The key difference from a JOIN is correlation: the right side of APPLY can reference columns from the current left row, so it is re-evaluated per row.

A regular JOIN cannot do this. In a join, both inputs are logically produced on their own and then matched through the ON predicate; the right table does not get to see a specific left row while it is being built. That is exactly why you reach for APPLY when the right side is a function that takes the left row as an argument, or a subquery whose TOP or ORDER BY must depend on the left row.

APPLY comes in two forms, and the analogy to joins makes them easy to remember:

CROSS APPLY works like an INNER JOIN: if the right expression returns no rows for a given left row, that left row is dropped from the result.
OUTER APPLY works like a LEFT JOIN: the left row is kept even when the right expression returns nothing, with NULLs in the right side columns.

The two classic uses are top-N-per-group (for each parent row, grab its most recent or highest few child rows with a correlated TOP ... ORDER BY) and calling a table-valued function per row so it can be parameterized by that row. See the FROM clause documentation for the full syntax.

CROSS APPLY: the most recent order for each customer (top-1-per-group)

SELECT c.name, o.order_id, o.order_date, o.total
FROM Customers AS c
CROSS APPLY (
    SELECT TOP (1) o.order_id, o.order_date, o.total
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
    ORDER BY o.order_date DESC
) AS o;

The inner TOP (1) is correlated to c.customer_id and re-runs per customer. Customers with no orders are dropped, just like an INNER JOIN.

OUTER APPLY: keep customers with no orders

SELECT c.name, o.order_id, o.total
FROM Customers AS c
OUTER APPLY (
    SELECT TOP (1) o.order_id, o.total
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
    ORDER BY o.total DESC
) AS o;

Same correlated subquery, but a customer with zero orders still appears with NULLs, exactly like a LEFT JOIN.

Follow-up questions to expect

When would CROSS APPLY outperform a correlated subquery or a ROW_NUMBER approach?
How does OUTER APPLY differ from LEFT JOIN when the right side is a table-valued function?
Can you rewrite a simple CROSS APPLY as an INNER JOIN, and when is that not possible?
What is the performance cost of calling a table-valued function once per row?

13 How do you read a query execution plan?

An execution plan is the set of physical steps the query optimizer chose to run your statement. In SQL Server Management Studio you read it right to left and top to bottom: the operators on the far right touch the raw tables, and data flows leftward into joins, sorts and aggregates until it reaches the final SELECT. Always capture the actual plan (Include Actual Execution Plan, or SET STATISTICS PROFILE ON) rather than the estimated plan, because only the actual plan carries the real row counts.

The operators are the vocabulary you need to recognize:

Index Seek navigates the B-tree straight to the rows it needs. This is what you want on a large table.
Index Scan or Table Scan reads every row (or the whole index). Fine on a small lookup table, a red flag on a big one.
Key Lookup (or RID Lookup) means a nonclustered index found the row but had to jump back to the clustered index for extra columns. Many lookups usually mean the index needs more INCLUDE columns.
Nested Loops join is cheap when one input is tiny; Hash Match join suits large unsorted inputs; Merge join is best when both inputs are already sorted on the join key.
Sort and Spool operators are often the expensive ones. A sort you did not ask for frequently points at a missing index that could deliver rows in order.

Two visual cues matter most. First, the thickness of the arrows between operators is proportional to the number of rows flowing along them, so a fat arrow feeding a sort or a loop join is where your cost lives. Second, hover any operator and compare Estimated Number of Rows with Actual Number of Rows: a large gap means the optimizer guessed wrong, usually because of stale or missing statistics, and that bad estimate cascades into poor operator and join choices. See statistics and cardinality for why that estimate drives everything.

A practical reading checklist: look for scans on big tables, repeated key lookups, expensive sorts or spools, and any estimate versus actual mismatch. Back the plan up with runtime numbers using SET STATISTICS IO ON (logical reads per table) and SET STATISTICS TIME ON (CPU and elapsed time). Fewer logical reads after a change is the clearest proof it helped. For the full workflow of turning these findings into fixes, see performance tuning.

Measure real work with STATISTICS IO and TIME

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT o.order_id, o.total
FROM Orders AS o
WHERE o.customer_id = 4821;

-- Read the Messages tab: 'logical reads' is pages touched.
-- A scan reports reads near the table's page count; a seek reports only a handful.

Logical reads and CPU time turn an abstract plan into concrete numbers you can compare before and after a change.

Turn a scan into a seek by adding an index

-- Before: no useful index, so the plan shows a Table/Clustered Index Scan.
SELECT order_id, total
FROM Orders
WHERE customer_id = 4821;

-- Add an index that covers the predicate and the output columns.
CREATE NONCLUSTERED INDEX IX_Orders_customer
    ON Orders (customer_id)
    INCLUDE (total);

-- After: the same query now shows an Index Seek with no Key Lookup,
-- and STATISTICS IO reports far fewer logical reads.

INCLUDE (total) makes the index covering, so the seek returns everything the SELECT needs without a Key Lookup back to the clustered index.

Follow-up questions to expect

What is the difference between an estimated plan and an actual plan?
When would a table scan be a better choice than an index seek?
What causes a large gap between estimated and actual row counts, and how do you fix it?
How do nested loops, hash and merge joins differ, and how does the optimizer pick one?

14 Why does the column order in a composite index matter?

A composite index (also called a multi-column or compound index) is built on two or more columns in a fixed order. The B-tree is sorted by the first key column, then by the second column within each value of the first, then by the third within that, and so on. It behaves exactly like a phone book sorted by last name and then first name: the ordering is only useful if you know the leading part of the key.

Because of that layout, the optimizer can perform an efficient index seek only on a leftmost prefix of the key columns. An index on (a, b) can seek on a, or on a and b together, but it cannot seek on b alone. A predicate on b by itself has to fall back to a full scan of the index or table, because rows matching a given b value are scattered throughout the structure.

Most selective / equality columns first. Columns filtered with equality (=) belong at the front so the seek narrows the range as quickly as possible.
Range columns last. Once you use a range predicate (>, <, BETWEEN, LIKE 'x%') on a column, the columns after it in the key can no longer be used for seeking, so put range filters at the end.
Order matches ORDER BY. If the index key order lines up with an ORDER BY (and the leading columns are fixed by equality), the engine can return rows already sorted and skip a separate sort operator.

This is why column order, not just column choice, drives whether an index helps. A well-ordered composite index can satisfy filtering and sorting in a single seek; a poorly ordered one forces scans and sorts. For more depth on how indexes are structured and chosen, see the guide on SQL indexes and the performance tuning notes.

A composite index on (a, b)

CREATE NONCLUSTERED INDEX IX_Orders_Cust_Date
    ON Orders (customer_id, order_date);

-- Seekable: equality on the leading column, range on the last
SELECT order_id, total
FROM Orders
WHERE customer_id = 42
  AND order_date >= '2026-01-01';

customer_id (equality) leads and order_date (range) trails, so the whole predicate is one efficient index seek.

Leading column vs trailing column alone

-- Uses the index: filters on the leftmost prefix (a)
SELECT order_id FROM Orders
WHERE customer_id = 42;

-- Cannot seek: order_date is the second key column, not a prefix
SELECT order_id FROM Orders
WHERE order_date = '2026-01-01';

The first query seeks on the leading column; the second must scan because order_date alone is not a leftmost prefix of (customer_id, order_date).

Follow-up questions to expect

How does a covering index differ from a plain composite index?
What is the difference between an index seek and an index scan?
How do included columns (INCLUDE) change what a composite index can serve?
When would you create two separate single-column indexes instead of one composite index?

15 What is the difference between optimistic and pessimistic concurrency?

The two strategies answer the same question - how do I stop two sessions from clobbering each other - with opposite assumptions about how often conflicts happen.

Pessimistic concurrency assumes conflicts are likely, so it takes a lock the moment a row is read and holds it until the transaction commits. Any other session that wants the same row waits. This guarantees that nobody reads stale data or overwrites your change, but it costs concurrency: readers and writers block each other, transactions must be kept short, and holding locks across multiple rows raises the risk of deadlocks.

Optimistic concurrency assumes conflicts are rare, so it takes no locks on read. Instead it remembers a version marker for the row - typically a rowversion (also called timestamp) column or a LastModified value - and, when it finally writes, it adds that marker to the WHERE clause. If another session changed the row in the meantime the version no longer matches, the UPDATE affects zero rows, and the application detects the lost update and retries with fresh data. This gives much higher concurrency because reads never block, at the price of having to handle the retry.

Pessimistic: lock on read, others wait. Safe, lower concurrency, deadlock risk. Good for hot rows with frequent contention.
Optimistic: no read lock, validate a version at write time, retry on conflict. High concurrency, but you must code the retry. Good for read-heavy workloads where clashes are uncommon.

Row versioning isolation levels support the optimistic pattern. Turning on snapshot isolation (or READ_COMMITTED_SNAPSHOT) lets readers see a consistent point-in-time version of a row from the version store instead of taking shared locks, so reads and writes stop blocking each other. Explicit lock hints like UPDLOCK drive the pessimistic pattern by reserving a row the instant you read it.

Optimistic update: guard on the rowversion, retry if zero rows change

-- Product has a rowversion column named RowVer.
-- 1) Read the row and remember its version.
DECLARE @ver rowversion;
SELECT @ver = RowVer
FROM Products
WHERE ProductId = 42;

-- 2) ...application edits the price, no locks held...

-- 3) Write only if nobody changed the row since we read it.
UPDATE Products
SET Price = 19.99
WHERE ProductId = 42
  AND RowVer = @ver;

IF @@ROWCOUNT = 0
    THROW 50001, 'Row changed by another user - reload and retry.', 1;

The RowVer comparison detects a lost update: if another session wrote first, no rows match and the app retries with fresh data.

Pessimistic read: reserve the row immediately with UPDLOCK

BEGIN TRANSACTION;

-- UPDLOCK takes an update lock on read, so any other session
-- trying to read-for-update this row must wait until we commit.
SELECT Balance
FROM Accounts WITH (UPDLOCK, HOLDLOCK)
WHERE AccountId = 7;

UPDATE Accounts
SET Balance = Balance - 100
WHERE AccountId = 7;

COMMIT TRANSACTION;

Holding the lock from read through write prevents a competing transaction from slipping in, at the cost of making it block.

Follow-up questions to expect

How does a rowversion (timestamp) column change value, and can you rely on it for ordering?
What is the lost update problem and how does each strategy prevent it?
How do snapshot and read committed snapshot isolation enable optimistic reads?
When would you choose pessimistic locking over optimistic in a high traffic system?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.

Get 1-on-1 Prep All Categories