Home › SQL Interview Questions › Applied

SQL Interview Prep

Scenario-Based SQL Questions

Q: Write a query to find the second highest salary.

Use DENSE_RANK to rank distinct salaries and pick rank 2, which correctly handles ties. Alternatives are ORDER BY salary DESC with OFFSET 1 ROWS FETCH NEXT 1, or a subquery finding the MAX salary that is less than the overall MAX.

Q: Write a query to find the Nth highest salary.

Rank the distinct salaries with DENSE_RANK() ordered descending and keep the row where the rank equals N. This generalizes the classic second-highest-salary problem: N = 2 gives the second highest, N = 3 the third, and so on.

Q: How do you delete duplicate rows but keep one copy?

Number the duplicates with ROW_NUMBER() OVER (PARTITION BY the duplicate columns ORDER BY a tiebreak) in a CTE, then DELETE the rows where the row number is greater than 1. Run it inside a transaction so you can check the count before you commit.

Q: Write a query to produce a running total of sales.

Use a windowed SUM with an ORDER BY frame: SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). Add PARTITION BY for a per group total, and always specify ROWS so ties do not collapse into one value the way the RANGE default would.

Q: How do you get the top N rows per group (for example top 3 per category)?

Number the rows inside each group with ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC), then in an outer query or CTE keep only the rows where that number is less than or equal to N. CROSS APPLY with SELECT TOP N is a common alternative.

Q: Find employees who earn more than their manager.

Self join the Employees table to itself, matching each employee row to its manager row on e.manager_id = m.emp_id, then keep only rows where e.salary > m.salary. An INNER JOIN naturally drops top level employees who have no manager.

Q: How do you solve a gaps and islands problem?

Number the rows in order with ROW_NUMBER(), then subtract that number from the sequence value (an id or date). Consecutive rows share the same difference, so you can GROUP BY it to collapse each run into one island and read the gaps between islands.

Q: How do you turn rows into columns (pivot monthly sales)?

Aggregate the value while spreading a category across columns: either the T-SQL PIVOT operator, or the portable SUM(CASE WHEN month = ... THEN amount END) pattern grouped by the row key. For unknown column values you must build the column list at runtime with dynamic SQL.

Q: Write a query for month over month growth.

Aggregate the rows into one total per month, then use LAG(total) OVER (ORDER BY month) to pull the previous month onto the same row. Subtract to get the change, and divide by the previous total (wrapped in NULLIF to avoid divide by zero) for the percent growth.

Q: Find all customers who have never placed an order.

Anti-join the two tables: LEFT JOIN Orders on the customer id and keep rows WHERE o.order_id IS NULL, or use WHERE NOT EXISTS with a correlated subquery. NOT EXISTS is preferred because it is NULL-safe, while NOT IN can return no rows if the subquery yields any NULL.

Scenario questions hand you a situation and ask you to write the query. These are the most practical interview tests. Each answer walks through the approach with a diagram of the data shape and a working SQL solution you can adapt.

Applied 15 questions

1 Write a query to find the second highest salary.

This is the classic SQL interview question, and the interviewer is really testing whether you understand ties and what "second highest" means. Almost always it means the second highest distinct salary value, not the value in the second row. If three people all earn the top salary, the second highest salary is still the next lower amount, not another copy of the top amount.

The cleanest and most robust answer uses DENSE_RANK. Ranking with DENSE_RANK assigns the same rank to equal salaries and does not skip numbers, so rank 1 is the highest distinct salary and rank 2 is the second highest distinct salary regardless of how many people share the top pay.

DENSE_RANK gives the true second distinct salary and generalises to the Nth highest by changing the filter to = N.
OFFSET FETCH is compact but works on rows, so you must apply DISTINCT to the salaries first, otherwise duplicated top salaries push the answer down.
Subquery MAX(< MAX) finds the largest salary strictly below the overall maximum. It is portable and needs no window functions.

A subtle point interviewers probe: DENSE_RANK vs RANK vs ROW_NUMBER. ROW_NUMBER would return an arbitrary single person from the tied top group and never reach the second distinct value; RANK skips numbers after ties so rank 2 may not exist. Only DENSE_RANK maps cleanly onto "second distinct salary".

Always confirm the edge case: if every employee earns the same salary there is no second highest, and a well written query should return NULL or an empty result rather than an error. The subquery form naturally returns NULL, which is why it is a safe default answer.

DENSE_RANK: the true second distinct salary (handles ties)

WITH Ranked AS (
    SELECT
        salary,
        DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM Employees
)
SELECT DISTINCT salary
FROM Ranked
WHERE rnk = 2;

DENSE_RANK gives every tied top salary rank 1, so rank 2 is always the next lower distinct amount. Change 2 to N for the Nth highest.

OFFSET FETCH: skip the top distinct salary, take the next

SELECT DISTINCT salary
FROM Employees
ORDER BY salary DESC
OFFSET 1 ROWS FETCH NEXT 1 ROWS ONLY;

DISTINCT collapses tied salaries first, then OFFSET 1 skips the highest and FETCH NEXT 1 returns the second highest.

Subquery: largest salary below the overall maximum

SELECT MAX(salary) AS second_highest
FROM Employees
WHERE salary < (SELECT MAX(salary) FROM Employees);

Portable and window-free. Returns NULL when all salaries are equal, which is the correct answer for that edge case.

Follow-up questions to expect

How would you generalise this to find the Nth highest salary?
Why does DENSE_RANK handle ties correctly here but ROW_NUMBER does not?
How do you return the second highest salary per department instead of overall?
What should the query return when every employee earns the same salary?

2 Write a query to find the Nth highest salary.

This is the general form of the classic second highest salary question. Instead of hard coding the value 2, you rank every salary from highest to lowest and return the row whose rank equals N. The subtle part interviewers listen for is how you handle ties: if two employees both earn the top salary, should they count as one rank or two? For "Nth highest distinct salary" the answer is one, which points you at DENSE_RANK().

The cleanest approach uses the DENSE_RANK() window function. It assigns 1 to the highest salary, 2 to the next distinct salary, and never skips a number even when values are tied. Wrap it in a subquery (a window function cannot appear directly in a WHERE clause) and filter for rnk = N:

DENSE_RANK() = N treats each distinct salary as one step, so ties share a rank. This is what most interviewers mean by "Nth highest salary".
ORDER BY ... OFFSET (N-1) ROWS FETCH NEXT 1 skips N-1 physical rows and returns the next one. It counts rows, not distinct values, so duplicate salaries can throw the position off unless you pre-aggregate with SELECT DISTINCT.

Use DENSE_RANK() when you care about distinct salary levels (three people tied for first still leave the second distinct salary as the "second highest"). Use OFFSET / FETCH when you literally want the Nth row in a sorted list. For a per-department answer, add PARTITION BY department_id so the ranking restarts inside each group and you get the Nth highest salary in every department at once.

Edge cases to mention: if the table has fewer than N distinct salaries, the query correctly returns no rows (an empty result), not zero and not NULL. Some interviewers want a single scalar even when nothing matches, in which case you wrap the query so it yields NULL. Always confirm whether N counts distinct salaries or raw rows before you write a line.

Nth highest distinct salary with DENSE_RANK()

DECLARE @N INT = 3;

SELECT MAX(salary) AS nth_highest_salary
FROM (
    SELECT salary,
           DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM Employees
) AS ranked
WHERE rnk = @N;

DENSE_RANK() ranks distinct salaries so ties share a rank. If fewer than N distinct salaries exist, MAX over an empty set returns NULL; drop the outer MAX to return no rows instead.

Per-department Nth highest with PARTITION BY

DECLARE @N INT = 2;

SELECT department_id, salary
FROM (
    SELECT department_id, salary,
           DENSE_RANK() OVER (
               PARTITION BY department_id
               ORDER BY salary DESC
           ) AS rnk
    FROM Employees
) AS ranked
WHERE rnk = @N
ORDER BY department_id;

PARTITION BY restarts the ranking inside each department, so one query returns the Nth highest salary for every department at once.

Follow-up questions to expect

How does DENSE_RANK() differ from RANK() and ROW_NUMBER() here?
How would you return the Nth highest salary per department in a single query?
What does your query return if there are fewer than N distinct salaries?
How would you solve this using OFFSET ... FETCH instead of a window function?

3 How do you delete duplicate rows but keep one copy?

The safe, standard pattern is to rank the copies within each group and delete everything except the first copy. You do this with the ROW_NUMBER() window function inside a CTE: ROW_NUMBER() OVER (PARTITION BY <the columns that define a duplicate> ORDER BY <a tiebreak>). Every row in a group of duplicates gets a number 1, 2, 3, and so on, so a plain DELETE WHERE rn > 1 removes the extras and keeps exactly one.

The PARTITION BY list is what "duplicate" means for your data. If two rows count as duplicates when their name and email match, partition by name, email. The ORDER BY inside the window decides which copy survives: number 1 is kept, so order so that the row you want to keep sorts first. For example ORDER BY created_at ASC keeps the oldest row and ORDER BY id DESC keeps the newest.

Always do this inside a transaction. Run the count or a SELECT version first, confirm you are about to delete the number of rows you expect, and only then COMMIT. If the count looks wrong, ROLLBACK and nothing is lost. This is a destructive operation on live data, so the extra step is worth it.

Two alternatives come up in interviews. A self join can delete duplicates by keeping the row with the minimum id per group and deleting rows that have a smaller-keyed twin, but it is harder to read and only works cleanly when the table has a unique key. A temp table approach selects the distinct or ranked rows into a new table, truncates the original, and reloads it, which is useful when there is no key at all or when almost every row is a duplicate.

ROW_NUMBER() is the go-to because it works even on a heap with no primary key, lets you pick the surviving copy precisely, and reads clearly. See the find duplicate rows question for the read-only version of the same technique.

Delete duplicates, keep the oldest copy, with ROW_NUMBER

WITH cte AS (
    SELECT
        ROW_NUMBER() OVER (
            PARTITION BY name, email
            ORDER BY created_at ASC
        ) AS rn
    FROM Customers
)
DELETE FROM cte
WHERE rn > 1;

PARTITION BY name, email defines a duplicate; ORDER BY created_at ASC keeps the earliest row (rn = 1) and deletes the rest.

Do it safely: check inside a transaction, then commit

BEGIN TRANSACTION;

WITH cte AS (
    SELECT
        ROW_NUMBER() OVER (
            PARTITION BY name, email
            ORDER BY id DESC
        ) AS rn
    FROM Customers
)
DELETE FROM cte
WHERE rn > 1;

-- Inspect @@ROWCOUNT; if it is what you expect, COMMIT, else ROLLBACK
SELECT @@ROWCOUNT AS rows_deleted;

COMMIT TRANSACTION;

ORDER BY id DESC keeps the newest copy. Review rows_deleted before COMMIT; ROLLBACK undoes everything if the count looks wrong.

Follow-up questions to expect

How would you decide which duplicate copy to keep?
What changes if the table has no primary key or unique column?
How is this different from using SELECT DISTINCT?
Would TRUNCATE or DROP ever be a better fit than DELETE here?

4 Write a query to produce a running total of sales.

A running total (a cumulative sum) adds each row to the sum of every row before it. The clean way to do this in SQL Server is a windowed SUM() with an OVER clause that carries an ORDER BY. The ordering defines what "before" means, and the frame defines how far back the sum reaches.

The full form is SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). UNBOUNDED PRECEDING starts the frame at the first row of the partition and CURRENT ROW ends it at the row being evaluated, so each row sees itself plus everything earlier in the order.

To get a running total per customer instead of one grand total, add PARTITION BY customer_id. The window resets at every new customer, so each customer accumulates independently while still ordering by date inside the group.

The detail interviewers probe is the frame. If you omit ROWS, SQL Server applies the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. RANGE groups peer rows that tie on the ORDER BY value, so two sales on the same date both receive the total through that date rather than a distinct step. Writing ROWS explicitly gives a true row by row total and also lets the optimizer use a faster in memory spool. See ROWS vs RANGE for the full contrast.

A close cousin is a moving (trailing) sum: bound the frame to the last few rows with ROWS 2 PRECEDING, which sums the current row and the two before it. That is the basis of moving averages and rolling windows. This whole family builds on the window function mechanics.

Running total of sales, overall and per customer

SELECT
    sale_date,
    customer_id,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total_per_customer
FROM Sales
ORDER BY sale_date;

The first SUM accumulates across all rows; the second resets for each customer because of PARTITION BY. ROWS makes both true row by row totals.

Moving (trailing) sum of the last three sales

SELECT
    sale_date,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS 2 PRECEDING
    ) AS trailing_3_total
FROM Sales
ORDER BY sale_date;

ROWS 2 PRECEDING is shorthand for BETWEEN 2 PRECEDING AND CURRENT ROW, summing the current row plus the two before it for a rolling window.

Follow-up questions to expect

Why should you specify ROWS instead of relying on the default RANGE frame?
How would you compute a running total that resets at the start of each month?
How does a moving average differ from a cumulative running total in the OVER clause?
What happens to the running total when two rows tie on the ORDER BY column?

5 How do you get the top N rows per group (for example top 3 per category)?

A plain TOP N or WHERE filter gives you the top N rows across the whole table, not per group. To get the top N within each group you need to restart the ranking for every group, and that is exactly what a window function with PARTITION BY does.

The standard pattern uses ROW_NUMBER(). You assign a sequential number to each row inside its partition, ordered by the metric you care about, and then filter on that number. Because the window function cannot appear directly in a WHERE clause (it is evaluated after WHERE), you wrap it in a CTE or derived table and filter in the outer query.

PARTITION BY category resets the counter to 1 at the start of each category.
ORDER BY sales DESC decides which rows count as the top ones.
WHERE rn <= 3 in the outer query keeps only the top 3 per category.

Your choice of ranking function decides how ties are handled. ROW_NUMBER() always gives exactly N rows per group but breaks ties arbitrarily, so two rows with equal sales get different numbers. If you want to include every row that ties for a top position, use RANK() or DENSE_RANK() instead, which is covered in RANK vs DENSE_RANK vs ROW_NUMBER.

An alternative that reads well is CROSS APPLY: for each row of the outer categories table, run a correlated SELECT TOP N ... ORDER BY against the detail table. This can be faster when a supporting index exists and the number of groups is small, because the optimizer can seek the top N rows per group rather than sort the whole table. See CROSS APPLY vs JOIN for the trade-offs.

ROW_NUMBER in a CTE: top 3 products by sales per category

WITH Ranked AS (
    SELECT
        category,
        product,
        sales,
        ROW_NUMBER() OVER (
            PARTITION BY category
            ORDER BY sales DESC
        ) AS rn
    FROM Products
)
SELECT category, product, sales
FROM Ranked
WHERE rn <= 3
ORDER BY category, sales DESC;

PARTITION BY restarts the numbering per category; the outer WHERE keeps only rn 1 to 3. Swap ROW_NUMBER for RANK or DENSE_RANK to include ties.

CROSS APPLY with TOP N: same result, one lookup per category

SELECT c.category, t.product, t.sales
FROM Categories AS c
CROSS APPLY (
    SELECT TOP (3) p.product, p.sales
    FROM Products AS p
    WHERE p.category = c.category
    ORDER BY p.sales DESC
) AS t
ORDER BY c.category, t.sales DESC;

For each category, the inner TOP (3) returns its three best selling products. With an index on (category, sales) the optimizer can seek instead of sorting everything.

Follow-up questions to expect

Why can you not put the ROW_NUMBER() result directly in a WHERE clause?
How would you change the query to include rows that tie for third place?
When would CROSS APPLY outperform the ROW_NUMBER approach, and what index helps?
How do you break ties deterministically so the same rows are returned every run?

6 Find employees who earn more than their manager.

This is a classic self join problem. All the data lives in a single Employees table where each row carries a manager_id that points back at the emp_id of another row in the same table. To compare an employee against their manager you need two copies of that one table at the same time, so you join the table to an aliased copy of itself.

The trick is the aliasing. You give the table two different names, for example e for the employee and m for the manager. SQL then treats e and m as if they were two separate tables even though they read from the same physical Employees table. The join condition e.manager_id = m.emp_id lines up each employee row with the single row that describes their manager.

Once the two rows sit side by side, the comparison is just a filter: WHERE e.salary > m.salary keeps only the employees who out earn the person they report to. Because this predicate references columns from both aliases, it can only be evaluated after the self join has paired the rows.

Aliases are mandatory here. Without e and m, the parser cannot tell which copy of a column such as salary or emp_id you mean, and the query is ambiguous.
Employees with no manager are excluded. A top level employee (for example the CEO) has manager_id = NULL. An INNER JOIN requires a matching m.emp_id, and nothing equals NULL, so those rows drop out automatically. That is usually the behaviour you want, since a person with no manager cannot earn more than one.
If you did want to keep managerless employees in the output, you would switch to a LEFT JOIN, but then the e.salary > m.salary filter would still remove them because comparing to NULL yields UNKNOWN.

This pattern generalises to any question that compares a row to a related row in the same table: reports to a manager, a product to its parent category, or any hierarchical model stored with a self referencing key.

Self join: employees who earn more than their manager

SELECT e.name AS employee
FROM Employees AS e
INNER JOIN Employees AS m
    ON e.manager_id = m.emp_id
WHERE e.salary > m.salary;

The same table is aliased as e (employee) and m (manager); the INNER JOIN drops anyone with a NULL manager_id.

Variant: list both salaries for context

SELECT e.name AS employee,
       e.salary AS employee_salary,
       m.name AS manager,
       m.salary AS manager_salary
FROM Employees AS e
INNER JOIN Employees AS m
    ON e.manager_id = m.emp_id
WHERE e.salary > m.salary
ORDER BY e.salary - m.salary DESC;

Selecting columns from both aliases shows the manager next to the employee and lets you sort by the size of the gap.

Follow-up questions to expect

How would you also include the employees who have no manager in the result?
What changes if you switch the INNER JOIN to a LEFT JOIN?
How would you find employees who earn more than the average salary of their department instead?
Why are table aliases required in a self join?

7 How do you solve a gaps and islands problem?

A gaps and islands problem asks you to find unbroken runs of consecutive values (the islands) and the ranges that are missing between them (the gaps). Typical cases are consecutive integer ids, consecutive dates such as login streaks, or contiguous ranges of numbers. The goal is usually to collapse each run into a single row showing its start and end.

The classic trick uses ROW_NUMBER(). You number the rows in order, then subtract that row number from the sequence value. Within one unbroken run both the value and the row number increase by one at every step, so the difference stays constant. The moment there is a gap the value jumps but the row number does not, so the difference changes and a new group begins.

Islands: compute value - ROW_NUMBER() as a group key, then GROUP BY it and take MIN and MAX of the value to get each run start and end.
Dates: the same idea works if you subtract a ROW_NUMBER() counted in days, for example DATEADD(DAY, -ROW_NUMBER() OVER (ORDER BY login_date), login_date). Every date in one streak maps to the same anchor date.
Gaps: once you have the islands, a gap is the range between the MAX of one island and the MIN of the next, which you can get with LEAD over the ordered island ends.

Partition the window when the runs are per entity. For a per-user login streak you write ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) so each user is numbered independently and streaks never bleed across users. This is one of the most common uses of a window function in interviews.

Interviewers like this problem because it rewards understanding rather than memorization: once you can explain why value - ROW_NUMBER() is constant inside a run, the rest is just GROUP BY. The Microsoft ROW_NUMBER documentation covers the ordering and partitioning details.

Islands over consecutive integer ids

-- Collapse runs of consecutive ids into start/end pairs
WITH Numbered AS (
    SELECT
        id,
        id - ROW_NUMBER() OVER (ORDER BY id) AS grp
    FROM Readings
)
SELECT
    MIN(id) AS island_start,
    MAX(id) AS island_end,
    COUNT(*) AS run_length
FROM Numbered
GROUP BY grp
ORDER BY island_start;

Consecutive ids share the same grp because id and the row number rise together; a gap changes grp and starts a new island.

Islands over consecutive login dates per user

-- Find each user's login streaks (start, end, days)
WITH Numbered AS (
    SELECT
        user_id,
        login_date,
        DATEADD(DAY,
            -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date),
            login_date) AS grp
    FROM Logins
)
SELECT
    user_id,
    MIN(login_date) AS streak_start,
    MAX(login_date) AS streak_end,
    COUNT(*)        AS days_in_streak
FROM Numbered
GROUP BY user_id, grp
ORDER BY user_id, streak_start;

Subtracting a per-user row number in days maps every date in one streak to the same anchor date, so GROUP BY collapses the streak.

Follow-up questions to expect

Why does subtracting ROW_NUMBER() from the value produce a constant group key inside a run?
How would you report the gaps (missing ranges) instead of the islands?
How does this change if the dates can contain duplicates for the same user?
Could you solve this with LAG instead of the ROW_NUMBER difference trick?

8 How do you turn rows into columns (pivot monthly sales)?

Pivoting means taking values that sit in rows and spreading them across columns, collapsing many rows into one wider summary row. A classic case is a sales table with one row per region per month that you want to reshape into one row per region with a column for each month. Every pivot has three parts: a row key to group on (for example region), a column source whose distinct values become the new columns (the month), and an aggregate that fills each cell (the summed amount).

SQL Server gives you the PIVOT operator as dedicated syntax. You list the aggregate over the value column, then FOR month IN ([Jan], [Feb], ...) to name the columns. It reads well but the column names must be hard coded in the query text.

The portable alternative is conditional aggregation: wrap a CASE expression inside an aggregate, one per target column, and GROUP BY the row key. SUM(CASE WHEN month = 'Jan' THEN amount END) adds up only the January rows and returns NULL for the rest, so each region collapses to a single row with a total per month. This works in essentially every database (MySQL, older engines, PostgreSQL) and is easy to extend with extra computed columns.

PIVOT operator is concise and self documenting, but SQL Server specific and needs a fixed column list.
SUM(CASE ...) is portable, flexible, and lets you mix aggregates, but is more verbose.
Both require you to know the column values up front. To pivot an unknown or growing set of months you must query the distinct values, build the column list into a string, and run it with dynamic SQL via sp_executesql.

Use ISNULL(..., 0) around each column if you want zeros instead of NULLs for empty cells. Reversing the operation, turning columns back into rows, is done with UNPIVOT or a CROSS APPLY of VALUES. A close cousin of this pattern is expressing each cell as a percent of total.

PIVOT operator: one row per region, a column per month

SELECT region, [Jan], [Feb], [Mar]
FROM (
    SELECT region, month, amount
    FROM Sales
) AS src
PIVOT (
    SUM(amount)
    FOR month IN ([Jan], [Feb], [Mar])
) AS pv
ORDER BY region;

The aggregate SUM(amount) fills each cell; FOR month IN (...) lists the distinct values that become columns.

Portable conditional aggregation with SUM(CASE)

SELECT region,
       SUM(CASE WHEN month = 'Jan' THEN amount END) AS Jan,
       SUM(CASE WHEN month = 'Feb' THEN amount END) AS Feb,
       SUM(CASE WHEN month = 'Mar' THEN amount END) AS Mar
FROM Sales
GROUP BY region
ORDER BY region;

Each CASE keeps only the rows for one month; GROUP BY region collapses them into a single wide row. Works in any SQL dialect.

Follow-up questions to expect

How would you pivot when the list of months is not known until run time?
What is the difference between PIVOT and SUM(CASE WHEN ...) in practice?
How do you turn the pivoted columns back into rows?
How would you show each month as a percentage of the region total?

9 Write a query for month over month growth.

Month over month (MoM) growth compares each month total to the month right before it. The pattern is two steps: first collapse the raw rows into one total per calendar month, then look back one row to fetch the previous month total so both values sit on the same row and can be compared.

The clean way to look back one row is the LAG window function. LAG(total) OVER (ORDER BY month) returns the total from the preceding row in month order, which is exactly the previous month. Because it is a window function it runs after the grouping, so you either aggregate in a CTE first and apply LAG to the CTE, or apply LAG over the aggregate in a second query level.

Absolute change is curr - prev.
Percent growth is (curr - prev) * 100.0 / prev. Multiply by 100.0 (not 100) so integer division does not truncate the result to zero.
The first month has no prior row, so LAG returns NULL and the growth is correctly NULL rather than a bogus number.

The one trap interviewers watch for is divide by zero. If a month total is zero, dividing by it raises error 8134 (Divide by zero encountered). Guard it with NULLIF: NULLIF(prev, 0) turns a zero denominator into NULL, so the whole expression yields NULL instead of crashing the query.

To report growth per product (or per region, customer, and so on), add PARTITION BY product_id to the window. LAG then restarts inside each product, so the previous value never leaks across product boundaries, and the first month of every product is NULL as expected. See LAG and LEAD for more on the offset functions.

Overall month over month growth with a divide-by-zero guard

WITH monthly AS (
    SELECT
        DATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1) AS month_start,
        SUM(total) AS revenue
    FROM Orders
    GROUP BY DATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1)
)
SELECT
    month_start,
    revenue,
    LAG(revenue) OVER (ORDER BY month_start) AS prev_revenue,
    revenue - LAG(revenue) OVER (ORDER BY month_start) AS change_amt,
    (revenue - LAG(revenue) OVER (ORDER BY month_start)) * 100.0
        / NULLIF(LAG(revenue) OVER (ORDER BY month_start), 0) AS pct_growth
FROM monthly
ORDER BY month_start;

The CTE builds one revenue row per month; LAG reads the prior month, and NULLIF turns a zero previous total into NULL so the division never raises error 8134.

Per product growth using PARTITION BY

WITH monthly AS (
    SELECT
        product_id,
        DATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1) AS month_start,
        SUM(total) AS revenue
    FROM Orders
    GROUP BY product_id,
             DATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1)
)
SELECT
    product_id,
    month_start,
    revenue,
    (revenue - LAG(revenue) OVER (PARTITION BY product_id ORDER BY month_start)) * 100.0
        / NULLIF(LAG(revenue) OVER (PARTITION BY product_id ORDER BY month_start), 0) AS pct_growth
FROM monthly
ORDER BY product_id, month_start;

PARTITION BY product_id restarts the window per product, so the previous month never carries over from one product to the next and each product first month is NULL.

Follow-up questions to expect

How would you compute year over year growth instead of month over month?
What happens for the first month, and how do you present a NULL growth to the business?
Why do you multiply by 100.0 rather than 100 in the percent calculation?
How would you fill in months that have no orders at all so the series has no gaps?

10 Find all customers who have never placed an order.

This is a classic anti-join: you want the rows in Customers that have no match in Orders. There is no single ANTI JOIN keyword in T-SQL, so you express it in one of two standard ways.

The first pattern is a LEFT JOIN ... IS NULL. A LEFT JOIN keeps every customer and fills the order columns with NULL when there is no matching order. Filtering WHERE o.order_id IS NULL then throws away the customers that did match, leaving only the ones who never ordered. Test the IS NULL against a column that can never be NULL for a real match, such as the primary key order_id.

The second pattern is NOT EXISTS with a correlated subquery. It reads as a direct statement of intent - return the customer when no order row exists for that customer - and it is usually the preferred choice. It is NULL-safe, it stops as soon as one matching order is found (a semi-join the optimizer handles well), and it does not need a DISTINCT to avoid duplicate rows the way a join sometimes does.

LEFT JOIN + IS NULL - intuitive and easy to visualize, but be careful to test a non-nullable column and you may need DISTINCT if a customer could match many rows.
NOT EXISTS - correlated, NULL-safe, and typically the cleanest and best performing option.
NOT IN - looks similar but is a trap when the subquery can produce NULLs.

Why avoid NOT IN (SELECT customer_id FROM Orders)? If Orders.customer_id contains even one NULL, the NOT IN predicate evaluates to UNKNOWN for every customer (because the engine cannot prove the value is not equal to an unknown), so the query returns zero rows. NOT EXISTS does not have this problem, which is why interviewers like to steer you toward it. See the related anti-join with NOT EXISTS question for more.

LEFT JOIN with IS NULL: keep customers with no matching order

SELECT c.customer_id, c.name
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;

The LEFT JOIN keeps every customer; IS NULL on the non-nullable order_id removes anyone who did match.

NOT EXISTS: the preferred, NULL-safe anti-join

SELECT c.customer_id, c.name
FROM Customers AS c
WHERE NOT EXISTS (
    SELECT 1
    FROM Orders AS o
    WHERE o.customer_id = c.customer_id
);

Returns a customer only when no order row exists for them; unaffected by NULLs and needs no DISTINCT.

Follow-up questions to expect

Why can NOT IN return no rows when the subquery contains a NULL?
When would the LEFT JOIN version need a DISTINCT?
How would you instead count orders per customer including the zeros?
Which of these anti-join forms would you expect to perform best and why?

11 How do you count rows per group and still show groups that have zero matches?

The classic mistake is to write INNER JOIN plus COUNT(*). An INNER JOIN only keeps rows that match on both sides, so a group with no child rows produces no rows at all and vanishes from the result. You get counts for busy groups but the quiet ones with zero matches silently disappear, which is almost never what the question is really asking.

The fix is to drive the query from the dimension table (the side you want every group of) and LEFT JOIN to the child table. A LEFT JOIN keeps every row from the left table and fills the right side with NULL when there is no match, so every group survives.

The second, subtler gotcha is which COUNT you use. COUNT(*) counts rows, and after a LEFT JOIN an unmatched group still has one row (with NULLs), so COUNT(*) would report 1 for an empty group. Instead use COUNT(child.key): COUNT(column) ignores NULLs, so an unmatched group counts zero non-NULL child keys and correctly returns 0.

INNER JOIN + COUNT(*): empty groups are dropped entirely. Wrong.
LEFT JOIN + COUNT(*): empty groups show 1. Still wrong.
LEFT JOIN + COUNT(child.key): empty groups show 0. Correct.

An alternative that avoids the join entirely is a correlated subquery in the SELECT list, which naturally returns 0 for empty groups. If you also want to substitute a value in other computed columns you can wrap results in COALESCE, though a plain COUNT(column) already yields 0 and never NULL.

LEFT JOIN with COUNT(child.key): every customer, including those with zero orders

SELECT c.customer_id, c.name,
       COUNT(o.order_id) AS order_count
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.name
ORDER BY order_count DESC;

COUNT(o.order_id) ignores the NULL produced for customers with no orders, so those customers correctly show 0 rather than 1.

Correlated subquery: same result without a join or GROUP BY

SELECT c.customer_id, c.name,
       (SELECT COUNT(*)
        FROM Orders AS o
        WHERE o.customer_id = c.customer_id) AS order_count
FROM Customers AS c
ORDER BY order_count DESC;

The inner COUNT(*) runs per customer and returns 0 when no orders match, so no LEFT JOIN or COALESCE is needed.

Follow-up questions to expect

Why does COUNT(*) return 1 instead of 0 for an empty group after a LEFT JOIN?
How would you list only the customers who have zero orders?
Would moving a filter on the Orders table into the WHERE clause change the result?
How does COUNT(column) differ from COUNT(*) and COUNT(DISTINCT column)?

12 How do you calculate the median in SQL?

The median is the middle value of a sorted set: half the values fall below it and half above. It is a more robust measure of the center than the average because it is not dragged around by a few extreme outliers. The catch in SQL is that there is no standard MEDIAN() aggregate, so you compute it either with an ordered-set function or by manually finding the middle row.

On SQL Server the clean modern way is PERCENTILE_CONT(0.5). It is a within-group ordered-set function: you give it a fraction (0.5 for the median), a WITHIN GROUP (ORDER BY ...) clause that sorts the values, and an OVER (...) window. With an even number of rows it interpolates between the two middle values, which matches the textbook definition of the median.

PERCENTILE_CONT(0.5) returns the interpolated median; its sibling PERCENTILE_DISC(0.5) returns an actual value from the data instead of averaging.
Both require the OVER () clause in SQL Server, so the result is repeated on every row. Wrap it in DISTINCT or a subquery when you want a single scalar.
Add OVER (PARTITION BY grp) to get a median per group in one pass, for example one median per department.

The portable approach works on almost any dialect. Use ROW_NUMBER() to rank rows in order, get the total with COUNT(), then keep only the middle position(s). The trick ORDER BY x ascending combined with a second window ordered descending, or the arithmetic (n + 1) / 2 and (n + 2) / 2, isolates the one middle row for odd counts and the two middle rows for even counts. Averaging those rows gives the same answer as PERCENTILE_CONT. This is a classic ranking-function interview pattern.

Interviewers like this question because it forces you to reason about even versus odd row counts and about the difference between a value that exists in the data and an interpolated one. See the Microsoft docs for PERCENTILE_CONT for the exact semantics.

PERCENTILE_CONT: median salary overall and per department

-- Overall median (repeated on every row, so DISTINCT it)
SELECT DISTINCT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
        OVER () AS median_salary
FROM Employees;

-- Median per department in a single pass
SELECT DISTINCT
    department_id,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
        OVER (PARTITION BY department_id) AS dept_median
FROM Employees;

PERCENTILE_CONT interpolates between the two middle values on an even count, giving the true median.

Portable ROW_NUMBER method: average the middle one or two rows

WITH Ranked AS (
    SELECT
        salary,
        ROW_NUMBER() OVER (ORDER BY salary)      AS rn_asc,
        COUNT(*)     OVER ()                      AS n
    FROM Employees
)
SELECT AVG(salary) AS median_salary
FROM Ranked
WHERE rn_asc IN ((n + 1) / 2, (n + 2) / 2);

For an odd n both expressions hit the same middle row; for an even n they select the two middle rows, and AVG returns their mean.

Follow-up questions to expect

What is the difference between PERCENTILE_CONT and PERCENTILE_DISC?
How does your median query handle an even versus an odd number of rows?
Why does SQL Server require the OVER () clause on PERCENTILE_CONT?
How would you compute a median without any window functions?

13 Find users with a streak of consecutive login days.

Tracking a streak of consecutive login days is a classic gaps-and-islands problem solved per user. An "island" is a run of dates with no gap between them, and each island is one streak. The trick is to build a group key that is identical for every date inside the same run but different across runs.

The key comes from a simple observation. If you order a user's logins by date and number them 1, 2, 3, ... with ROW_NUMBER, then for a run of consecutive days the login date and the row number both increase by exactly one on every step. So the difference between them stays constant. Subtracting the row number as days from the date with DATEADD turns that constant difference into a concrete anchor date that labels the whole island.

PARTITION BY user so each user is numbered independently, starting from 1.
ORDER BY login_date so the numbering follows calendar order.
DATEADD(day, -rn, login_date) yields the same date for every login in a run, and a different date the moment a gap appears.

Once you have that group key, a GROUP BY user_id, grp with COUNT(*) gives the length of each streak, and MIN/MAX give its start and end dates. A HAVING COUNT(*) >= N filter keeps only streaks of at least N days.

One prerequisite: the login dates must be distinct per user before you number them. If the raw table has multiple logins on the same day, collapse them first with SELECT DISTINCT user_id, CAST(login_ts AS date), otherwise duplicate dates break the one-day-per-step assumption. If window functions are new to you, start with what is a window function.

Build the island key and measure each streak

WITH days AS (
    SELECT DISTINCT user_id, CAST(login_ts AS date) AS login_date
    FROM Logins
),
grouped AS (
    SELECT
        user_id,
        login_date,
        DATEADD(day,
            -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date),
            login_date) AS grp
    FROM days
)
SELECT
    user_id,
    MIN(login_date) AS streak_start,
    MAX(login_date) AS streak_end,
    COUNT(*)        AS streak_length
FROM grouped
GROUP BY user_id, grp
ORDER BY user_id, streak_start;

ROW_NUMBER walks each user in date order; subtracting it as days makes grp constant across a run and changes it after any gap.

Keep only users with a streak of at least 5 days

WITH days AS (
    SELECT DISTINCT user_id, CAST(login_ts AS date) AS login_date
    FROM Logins
),
grouped AS (
    SELECT
        user_id,
        login_date,
        DATEADD(day,
            -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date),
            login_date) AS grp
    FROM days
)
SELECT user_id, COUNT(*) AS streak_length
FROM grouped
GROUP BY user_id, grp
HAVING COUNT(*) >= 5
ORDER BY streak_length DESC;

HAVING filters after grouping, so it applies to the full length of each island. Change 5 to your threshold N.

Follow-up questions to expect

Why does subtracting ROW_NUMBER from the date produce a constant group key for consecutive days?
How would you handle multiple logins on the same day before numbering the rows?
How would you find the single longest streak per user instead of listing all streaks?
How would the query change if a streak could span across month or year boundaries?

14 How do you count distinct values within each group?

To count the number of distinct values inside each group you combine COUNT(DISTINCT col) with a GROUP BY. The DISTINCT keyword removes duplicates before the count runs, so COUNT(DISTINCT product_id) tells you how many different products a customer bought, not how many order lines they have.

The classic trap comes when you want a running (cumulative) distinct count that grows row by row. In SQL Server, COUNT(DISTINCT ...) is not allowed as a window function: writing COUNT(DISTINCT col) OVER (...) raises "Use of DISTINCT is not allowed with the OVER clause". Only the plain, non-distinct aggregates support OVER.

The standard workaround is DENSE_RANK(). DENSE_RANK() assigns the same rank to equal values and no gaps, so the highest dense rank seen so far equals the number of distinct values seen so far. You compute two dense ranks (ascending and descending on the same column), add them, and subtract one; or, more simply, take MAX(DENSE_RANK() OVER (ORDER BY col)) over the window to get a running distinct count.

Per group distinct count: COUNT(DISTINCT col) plus GROUP BY collapses each group to one row.
Windowed distinct count: not supported directly; emulate it with DENSE_RANK().
Watch NULLs: like other aggregates, COUNT(DISTINCT col) ignores NULL values.

Knowing both the simple GROUP BY form and the DENSE_RANK() escape hatch is exactly what an interviewer is checking, because it shows you understand the difference between a grouped aggregate and a window computation.

Per group distinct count with GROUP BY

SELECT
    customer_id,
    COUNT(DISTINCT product_id) AS distinct_products,
    COUNT(*)                    AS total_lines
FROM OrderLines
GROUP BY customer_id;

One row per customer: distinct_products counts different products, total_lines counts every order line including repeats.

Running distinct count via DENSE_RANK (window workaround)

-- COUNT(DISTINCT product_id) OVER (...) is NOT allowed in SQL Server,
-- so use MAX(DENSE_RANK()) as a running distinct count instead.
SELECT
    order_date,
    product_id,
    MAX(dr) OVER (ORDER BY order_date
                  ROWS UNBOUNDED PRECEDING) AS running_distinct_products
FROM (
    SELECT
        order_date,
        product_id,
        DENSE_RANK() OVER (ORDER BY product_id) AS dr
    FROM Orders
) AS ranked
ORDER BY order_date;

DENSE_RANK numbers each distinct product; the running MAX of that rank equals the number of distinct products seen up to each row.

Follow-up questions to expect

Why does SQL Server reject COUNT(DISTINCT col) OVER (...)?
How does DENSE_RANK differ from RANK and ROW_NUMBER here?
How are NULL values treated by COUNT(DISTINCT col)?
How would you count distinct values across two columns as a combined key?

15 Write a query that shows each row as a percent of the group total.

The trick is a SUM window function. A plain SUM(amount) with GROUP BY would collapse many rows into one total, which you do not want. Adding an OVER (PARTITION BY grp) clause turns it into a windowed aggregate: it computes the total for each group but keeps every detail row, so you can put the row value and its group total side by side.

Once both values sit on the same row, the percentage is just amount * 100.0 / SUM(amount) OVER (PARTITION BY grp). Reading it left to right: PARTITION BY grp defines the group whose total you divide by, SUM(...) OVER (...) is that group total repeated on every row, and each row divides its own amount into it.

Watch the data types. If amount is an integer, amount / SUM(amount) uses integer division and every result truncates to 0. Multiply by 100.0 (a decimal literal) or wrap a side in CAST(amount AS decimal(18,4)) so the whole expression is evaluated in decimal. Put the 100.0 before the division, not after, so the numerator is already decimal.

Guard against a zero or empty group total. If a partition sums to 0 the division throws a divide-by-zero error, so wrap the denominator with NULLIF: ... / NULLIF(SUM(amount) OVER (PARTITION BY grp), 0). When the total is 0, NULLIF returns NULL and the row yields NULL instead of erroring out.

To get a percent of the grand total instead of a group total, use an empty OVER () with no PARTITION BY. That treats the whole result set as one window, which is a clean way to show what share each row contributes overall.

Percent of the grand total (empty OVER)

SELECT
    category,
    amount,
    amount * 100.0 / NULLIF(SUM(amount) OVER (), 0) AS pct_of_total
FROM Sales
ORDER BY category;

OVER () with no PARTITION BY sums across all rows, so each row shows its share of the overall total.

Percent of each group total (PARTITION BY)

SELECT
    region,
    product,
    amount,
    amount * 100.0
        / NULLIF(SUM(amount) OVER (PARTITION BY region), 0) AS pct_of_region
FROM Sales
ORDER BY region, product;

PARTITION BY region resets the total per region, and 100.0 forces decimal division so integer amounts do not truncate to 0.

Follow-up questions to expect

Why does OVER (PARTITION BY ...) keep every row while GROUP BY collapses them?
What happens if amount is an integer and you divide before multiplying by 100.0?
How does NULLIF protect against a divide-by-zero error here?
How would you also show a running percent as rows accumulate within each group?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.

Get 1-on-1 Prep All Categories