Home › SQL Interview Questions › Beginner

SQL Interview Prep

Basic SQL Interview Questions

Q: What is a SQL JOIN and what are the main types?

A JOIN combines rows from two or more tables using a related column. The main types are INNER (only matching rows), LEFT and RIGHT OUTER (all of one side plus matches), FULL OUTER (all rows from both), CROSS (every combination) and a self join (a table joined to a copy of itself).

Q: What is the difference between an INNER JOIN and a LEFT JOIN?

An INNER JOIN returns only rows that have a match in both tables. A LEFT JOIN returns every row from the left table plus its matches from the right, filling NULLs in the right columns where no match exists.

Q: What is a primary key?

A primary key is a column or set of columns that uniquely identifies every row in a table. Its values must be unique and NOT NULL, and each table can have only one primary key.

Q: What is a foreign key and what does it enforce?

A foreign key is a column (or set of columns) whose values must match a primary or unique key in another table, or the same table. It enforces referential integrity, so a child row cannot reference a parent that does not exist and a referenced parent cannot be deleted while children point to it.

Q: What is the difference between a primary key and a unique key?

Both enforce uniqueness, but a table can have only one primary key and it cannot contain NULLs, while you can have many unique constraints and each allows a single NULL in SQL Server. A primary key is usually clustered; both can be referenced by a foreign key.

Q: What is the difference between DELETE, TRUNCATE and DROP?

DELETE removes selected rows and is fully logged, so it can use a WHERE clause. TRUNCATE removes all rows quickly with minimal logging and resets IDENTITY, but takes no WHERE. DROP removes the whole table, both structure and data. In SQL Server all three can be rolled back inside a transaction.

Q: What is a NULL value and how is it different from zero or an empty string?

NULL marks a value that is unknown or missing, not the number zero and not an empty string, both of which are actual values. Because NULL means unknown, comparisons against it return UNKNOWN, so you test for it with IS NULL and IS NOT NULL rather than = NULL.

Q: What is the difference between UNION and UNION ALL?

Both stack the rows of two result sets that have the same number of columns and compatible types. UNION removes duplicate rows, which costs an extra sort or hash step, while UNION ALL keeps every row and is faster.

Q: What does the GROUP BY clause do?

GROUP BY collapses rows that share the same values in the grouping columns into a single row per group, so aggregate functions like COUNT, SUM, AVG, MIN and MAX are computed once per group instead of over the whole table.

Q: What is the difference between CHAR and VARCHAR?

CHAR(n) is a fixed-length type that always uses n characters and right-pads shorter values with spaces, while VARCHAR(n) is variable length and stores only the characters used plus a small length overhead. Use CHAR for values that are truly a fixed size and VARCHAR for anything that varies.

These are the foundational questions almost every SQL interview opens with. Each answer includes a clear diagram and runnable SQL so you can explain the concept and show the syntax with confidence.

Beginner 15 questions

1 What is a SQL JOIN and what are the main types?

A JOIN lets you query data that lives across more than one table by matching rows on a related column, usually a foreign key that points at a primary key. Instead of storing everything in one wide table, a relational database splits data into focused tables and stitches them back together at query time with a join.

The type of join controls which rows survive when there is no match on the other side. This is the single most important idea to get right in an interview, so it helps to picture each join as a Venn diagram of two sets, A (the left table) and B (the right table).

INNER JOIN keeps only rows that have a match in both tables.
LEFT OUTER JOIN keeps every row from the left table, filling NULLs where the right table has no match.
RIGHT OUTER JOIN is the mirror image: every row from the right table plus matches from the left.
FULL OUTER JOIN keeps every row from both tables, matched where possible.
CROSS JOIN pairs every row of A with every row of B (a Cartesian product) and has no ON clause.
A self join joins a table to an aliased copy of itself, which is how you model hierarchies like employee and manager.

In practice INNER and LEFT joins cover the large majority of real queries. Being able to explain, with a diagram, why a LEFT JOIN returns rows that an INNER JOIN drops is exactly what an interviewer is listening for.

INNER JOIN: customers that have placed at least one order

SELECT c.name, o.order_id, o.total
FROM Customers AS c
INNER JOIN Orders AS o
    ON o.customer_id = c.customer_id;

A customer with no orders is excluded, because there is no matching row in Orders.

LEFT JOIN: every customer, with their orders if any

SELECT c.name, o.order_id, o.total
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id;

Customers with no orders still appear, with NULL in the order columns. Adding WHERE o.order_id IS NULL then finds customers who never ordered.

Self join: pair each employee with their manager

SELECT e.name AS employee, m.name AS manager
FROM Employees AS e
LEFT JOIN Employees AS m
    ON e.manager_id = m.emp_id;

The same table is aliased twice (e and m) so one row can reference another row in the same table.

Follow-up questions to expect

What is the difference between an INNER JOIN and a LEFT JOIN?
How would you find rows in one table that have no match in another?
What does a CROSS JOIN return, and when is it useful?
Is WHERE after a LEFT JOIN the same as putting the condition in the ON clause?

2 What is the difference between an INNER JOIN and a LEFT JOIN?

Both are types of SQL JOIN, and both match rows on a related column in the ON clause. The difference is what happens to a left-table row that has no match on the right side.

INNER JOIN keeps a row only when the ON condition finds a match in both tables. Unmatched rows on either side are dropped.
LEFT JOIN (short for LEFT OUTER JOIN) keeps every row from the left table. Where the right table has a match it fills in those columns; where it does not, it puts NULL in every right-side column.

Because a LEFT JOIN preserves unmatched left rows and marks them with NULLs, it powers the classic find rows with no match pattern (also called an anti join): LEFT JOIN the two tables, then WHERE right_table.key IS NULL to keep only the rows that failed to match. See customers with no orders for a full walkthrough.

One common trap: adding a filter on the right table in the WHERE clause can silently turn a LEFT JOIN back into an INNER JOIN. Rows that did not match have NULL in every right column, and a predicate like o.status = 'Shipped' is never true for NULL, so those unmatched rows get thrown away. If you need to filter the right table but still keep unmatched left rows, move the condition into the ON clause (or test for NULL explicitly).

Rule of thumb: reach for an INNER JOIN when you only care about records that exist on both sides, and a LEFT JOIN when the left table is the anchor you must not lose, such as listing all customers whether or not they have ordered.

INNER JOIN: only customers who have placed orders

SELECT c.name, o.order_id, o.total
FROM Customers AS c
INNER JOIN Orders AS o
    ON o.customer_id = c.customer_id;

A customer with zero orders never appears, because no matching row exists in Orders.

LEFT JOIN: every customer, orders filled in when they exist

SELECT c.name, o.order_id, o.total
FROM Customers AS c
LEFT JOIN Orders AS o
    ON o.customer_id = c.customer_id;

Customers with no orders still show up, with NULL in order_id and total. Add WHERE o.order_id IS NULL to keep only the customers who never ordered.

Follow-up questions to expect

How would you list customers who have never placed an order?
Why can a WHERE clause on the right table turn a LEFT JOIN into an INNER JOIN?
What is the difference between a LEFT JOIN and a RIGHT JOIN?
How does a FULL OUTER JOIN differ from a LEFT JOIN?

3 What is a primary key?

A primary key is the column, or combination of columns, that uniquely identifies each row in a table. It enforces entity integrity: every row can be found and referenced by exactly one key value, and no two rows share that value.

The primary key constraint guarantees two things at once. First, the values must be unique across all rows. Second, the columns must be NOT NULL, because a missing value could not reliably identify a row. A table can have only one primary key, though that key may span several columns.

In SQL Server, adding a PRIMARY KEY also creates a supporting index, which is a clustered index by default. This physically orders the table by the key and makes lookups on it fast.
A key built from one column is a single-column key; a key built from two or more columns is a composite (or compound) key, and the combination must be unique even if each column alone is not.
A surrogate key is a system-generated value with no business meaning, usually an IDENTITY or a GUID. A natural key uses real data such as an email or ISBN. Surrogate keys are common because they stay stable when business data changes.

The primary key is also the anchor that other tables point at: a foreign key references a primary key to link related rows. Choosing a good primary key early is one of the most consequential design decisions you make, because it drives both your relationships and your clustered storage order.

Single-column primary key with IDENTITY surrogate key

CREATE TABLE Customers (
    customer_id INT IDENTITY(1,1) PRIMARY KEY,
    email       VARCHAR(255) NOT NULL,
    name        VARCHAR(100) NOT NULL
);

customer_id is a surrogate key: SQL Server auto-generates a unique value for every new row.

Composite primary key across two columns

CREATE TABLE OrderItems (
    order_id   INT NOT NULL,
    product_id INT NOT NULL,
    quantity   INT NOT NULL,
    CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id)
);

The pair (order_id, product_id) must be unique; the same product can appear on many orders, but only once per order.

Follow-up questions to expect

What is the difference between a primary key and a unique key?
Why must primary key columns be NOT NULL?
Would you prefer a surrogate key or a natural key, and why?
What is the relationship between a primary key and a clustered index in SQL Server?

4 What is a foreign key and what does it enforce?

A foreign key is a column, or a group of columns, in one table that references a primary key or a UNIQUE key in another table. It can also point back into the same table, which is how you model a hierarchy such as an employee row that references its manager. The referencing table is the child and the referenced table is the parent.

Its job is to enforce referential integrity: every value in the foreign key column must exist as a key value in the parent, or be NULL. This gives you two guarantees. You cannot insert or update a child row that points at a parent that does not exist, and you cannot delete or change a parent row while child rows still reference it. Either violation is rejected by the engine.

In SQL Server a broken reference raises error 547, the REFERENCE / FOREIGN KEY constraint conflict. You control what happens to children when the parent changes using referential actions on the constraint:

NO ACTION (the default) blocks the delete or update and raises error 547 if children exist.
CASCADE propagates the change: ON DELETE CASCADE removes the matching child rows automatically, and ON UPDATE CASCADE rewrites the child key values.
SET NULL and SET DEFAULT clear or reset the child foreign key column instead of blocking or deleting.

Foreign keys are one of the core declarative SQL constraints. They are how a one-to-many relationship is expressed and protected, and the engine also uses them to reason about query plans. Use CASCADE deliberately, since a single delete can quietly remove a large tree of dependent rows.

Declare a foreign key in CREATE TABLE

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    name        NVARCHAR(100) NOT NULL
);

CREATE TABLE Orders (
    order_id    INT PRIMARY KEY,
    customer_id INT NOT NULL,
    total       DECIMAL(10,2),
    CONSTRAINT FK_Orders_Customers
        FOREIGN KEY (customer_id)
        REFERENCES Customers (customer_id)
);

An Order can only reference a customer_id that already exists in Customers; anything else fails with error 547.

ON DELETE CASCADE removes child rows with the parent

ALTER TABLE Orders
ADD CONSTRAINT FK_Orders_Customers_Cascade
    FOREIGN KEY (customer_id)
    REFERENCES Customers (customer_id)
    ON DELETE CASCADE
    ON UPDATE NO ACTION;

-- Deleting a customer now also deletes that customer's orders
DELETE FROM Customers WHERE customer_id = 42;

With CASCADE the child orders are removed automatically; with the default NO ACTION the delete would be blocked while orders exist.

Follow-up questions to expect

What is the difference between ON DELETE CASCADE and ON DELETE NO ACTION?
Can a foreign key reference a column that is not a primary key?
Can a foreign key column contain NULL values, and what does that mean?
What error does SQL Server raise when a foreign key is violated?

5 What is the difference between a primary key and a unique key?

A primary key and a unique key (a UNIQUE constraint) both guarantee that the values in a column, or a set of columns, are unique across every row. The difference is about how many you can have, how they treat NULL, and the role each plays in the table design.

How many per table. A table can have exactly one primary key. You can define many unique constraints on the same table, one per set of columns that must stay distinct.
NULL handling. A primary key column is implicitly NOT NULL and rejects NULLs. A unique constraint allows a single NULL in SQL Server (because two NULLs are treated as equal for uniqueness), so at most one row can leave that column empty.
Physical storage. A primary key creates a clustered index by default, so it also sets the physical row order. A unique constraint creates a nonclustered index by default. Both defaults can be overridden.
Intent. The primary key is the row identity that other tables point at; a unique constraint protects an alternate key such as an email or SSN.

Both are valid constraints that a foreign key can reference. A child table can point its foreign key at a primary key or at any unique constraint on the parent, as long as the referenced columns are guaranteed unique. See also primary key vs foreign key for the relationship side of this.

When you insert or update a duplicate value, SQL Server raises one of two closely related errors. Violating a primary key or a UNIQUE constraint raises error 2627, while violating a plain unique index raises error 2601. Both mean the same thing at heart: the value you tried to write already exists.

PRIMARY KEY: one per table, NOT NULL, clustered by default

CREATE TABLE Customers (
    customer_id INT         NOT NULL,
    email       VARCHAR(255) NULL,
    CONSTRAINT PK_Customers PRIMARY KEY (customer_id)
);

customer_id is the row identity: unique, non-nullable, and (by default) the clustered index that orders the table.

UNIQUE constraint: many allowed, one NULL permitted

ALTER TABLE Customers
    ADD CONSTRAINT UQ_Customers_Email UNIQUE (email);

-- A second row with the same email fails with error 2627.
-- A single row may still leave email NULL; a second NULL would also fail.

email is an alternate key: it must be unique but is allowed to be NULL in exactly one row.

Follow-up questions to expect

Can a foreign key reference a unique constraint instead of a primary key?
Why can a primary key never contain a NULL?
What is the difference between a unique constraint and a unique index?
What does it mean that a primary key is clustered by default?

6 What is the difference between DELETE, TRUNCATE and DROP?

All three statements delete data, but they work at different levels and with very different costs. Interviewers ask this to check that you understand logging, identity behavior and the difference between a DML statement (that changes rows) and a DDL statement (that changes the schema itself).

DELETE is a DML statement that removes rows one at a time and writes each removed row to the transaction log. Because it is row based, it can take a WHERE clause to remove only some rows, it fires row DELETE triggers, and it leaves the table structure and any IDENTITY seed untouched.
TRUNCATE is a DDL statement that removes every row by deallocating the table pages instead of logging each row, so it is much faster and only minimally logged. It cannot use a WHERE clause, it resets the IDENTITY counter back to its seed, and it is blocked if another table has a foreign key that references this one.
DROP is a DDL statement that removes the entire table object, both the data and the definition (columns, indexes, constraints and permissions). After a DROP the table no longer exists at all.

A common myth is that TRUNCATE and DROP cannot be undone. In SQL Server, all three are transactional: if you wrap them in a BEGIN TRAN and issue ROLLBACK, the table and its rows come back. What differs is speed and side effects, not whether a transaction can protect them. See what a transaction is for more on that guarantee.

Rules of thumb: use DELETE when you need to remove a filtered set of rows or want triggers to fire; use TRUNCATE to empty a whole staging table fast while preserving its structure; use DROP only when you truly want the table gone. Note that TRUNCATE requires ALTER permission, so DELETE is sometimes the only option a low privileged account has. You can read the official rules in the TRUNCATE TABLE documentation.

DELETE selected rows, TRUNCATE all rows

-- DELETE: row based, filtered, fully logged, IDENTITY untouched
DELETE FROM dbo.Orders
WHERE order_date < '2023-01-01';

-- TRUNCATE: empties the whole table fast and resets IDENTITY to its seed
TRUNCATE TABLE dbo.StagingOrders;

DELETE keeps the current IDENTITY value; TRUNCATE resets it back to the original seed.

All three can be rolled back in a transaction

BEGIN TRAN;

    DROP TABLE dbo.TempReport;   -- table object is gone... for now

ROLLBACK;   -- the table and its data are fully restored

In SQL Server DROP, TRUNCATE and DELETE are all transactional, so ROLLBACK brings the object back.

Follow-up questions to expect

Why is TRUNCATE faster than a DELETE with no WHERE clause?
Why can you not TRUNCATE a table that is referenced by a foreign key?
What happens to the IDENTITY seed after a DELETE versus a TRUNCATE?
Do DELETE triggers fire when you TRUNCATE a table?

7 What is a NULL value and how is it different from zero or an empty string?

NULL is a special marker that means a value is unknown or missing, not that it is empty or zero. Zero (0) is a definite number and an empty string ('') is a definite string of length zero, but NULL is the absence of any value at all. Two rows that both hold NULL are not known to be equal, because you do not know what either value actually is.

This leads to three-valued logic. In SQL a comparison does not just return TRUE or FALSE; it can also return UNKNOWN. Any comparison with NULL, such as salary > 100 or a = b, yields UNKNOWN when a value is NULL. A WHERE clause keeps only rows that evaluate to TRUE, so rows that evaluate to UNKNOWN are silently dropped.

Because of this, column = NULL never matches anything, even for rows that are NULL. You must use the dedicated operators IS NULL and IS NOT NULL, which are the only correct way to test for the presence or absence of a value.

NULL also changes how aggregates behave. COUNT(*) counts every row, but COUNT(column) counts only rows where that column is not NULL. Functions like SUM, AVG and MAX ignore NULLs as well, so AVG(col) divides by the number of non-NULL values, not by the total row count.

To replace a NULL with a fallback value you use COALESCE, the ANSI standard function that returns the first non-NULL argument, or the SQL Server specific ISNULL. For a deeper walkthrough with more examples see SQL NULL values explained.

Test for missing values with IS NULL, not = NULL

-- Rows where phone was never recorded
SELECT customer_id, name
FROM Customers
WHERE phone IS NULL;

-- WRONG: this returns no rows, because = NULL is never TRUE
SELECT customer_id, name
FROM Customers
WHERE phone = NULL;

Only IS NULL matches missing values; a plain equality against NULL always evaluates to UNKNOWN and filters the row out.

Substitute a fallback with COALESCE

SELECT
    name,
    COALESCE(phone, 'no phone on file') AS phone_display,
    COALESCE(discount, 0) AS discount
FROM Customers;

COALESCE returns the first non-NULL argument, so unknown phone numbers and discounts show a readable default instead of NULL.

Follow-up questions to expect

What is the difference between COUNT(*) and COUNT(column)?
What does COALESCE return, and how does it differ from ISNULL in SQL Server?
Why does SELECT * FROM t WHERE col = NULL return no rows?
How do NULLs behave inside a GROUP BY or a UNIQUE constraint?

8 What is the difference between UNION and UNION ALL?

UNION and UNION ALL are set operators that stack one result set on top of another, so the combined output has all the rows of the first query followed by all the rows of the second. For this to work both queries must return the same number of columns in the same order, and each pair of columns must have compatible data types. The column names of the final result come from the first query.

The only functional difference is how they treat duplicates:

UNION removes duplicate rows, returning a set of distinct rows across both queries. To do this the engine must compare every row against the others, which it usually does with a sort or a hash aggregate, so there is extra work involved.
UNION ALL keeps every row, including duplicates. It simply concatenates the two inputs with no de-duplication step, so it is faster and uses less memory.

Because of that hidden distinct step, UNION behaves a lot like wrapping the combined output in SELECT DISTINCT. If you know duplicates are impossible (for example the two queries cover non-overlapping ranges) or you actually want the duplicates counted, prefer UNION ALL so you do not pay for a de-duplication you did not ask for.

A common interview trap: people reach for UNION by habit and quietly change the result by dropping legitimate repeated rows, or they slow a report down on large tables. Default to UNION ALL and only switch to UNION when you genuinely need distinct rows.

UNION: distinct list of cities across two tables

SELECT city FROM Customers
UNION
SELECT city FROM Suppliers;

A city that appears in both tables is returned only once, because UNION removes duplicate rows.

UNION ALL: keep every row, including duplicates

SELECT city FROM Customers
UNION ALL
SELECT city FROM Suppliers;

A city in both tables appears twice. No sort or hash step is needed, so this runs faster.

Follow-up questions to expect

Why can UNION be slower than UNION ALL on large tables?
What rules must the two SELECT statements satisfy for a UNION to be valid?
How do you control the ordering of a combined UNION result?
How is UNION different from JOIN?

9 What does the GROUP BY clause do?

GROUP BY partitions the rows returned by a query into groups that share the same values in the listed columns, and then produces exactly one output row per group. It is what turns a row-level query into a summary: instead of one row per order you get one row per customer, one row per month, or one row per status.

Grouping only becomes useful together with aggregate functions. An aggregate like COUNT, SUM, AVG, MIN or MAX takes all the rows inside a single group and reduces them to one value. Without GROUP BY an aggregate collapses the entire result set into a single row; with GROUP BY it is evaluated separately for each group.

Every column in the SELECT list that is not wrapped in an aggregate function must also appear in the GROUP BY clause. Otherwise SQL Server cannot decide which value from the group to show and raises an error.
To filter groups after aggregation, use HAVING, not WHERE. WHERE filters individual rows before they are grouped; HAVING filters the grouped result and can reference aggregates such as HAVING COUNT(*) > 1.
The logical order of evaluation is FROM, then WHERE, then GROUP BY, then HAVING, then SELECT, then ORDER BY, which is why an alias defined in SELECT is not visible to WHERE or GROUP BY.

A clean way to remember it: WHERE decides which rows go into the buckets, GROUP BY defines the buckets, the aggregate summarizes each bucket, and HAVING throws away whole buckets. For a deeper walkthrough of how these clauses fit together see ORDER BY, GROUP BY and HAVING explained.

GROUP BY with SUM: total sales per customer

SELECT customer_id, COUNT(*) AS order_count, SUM(total) AS total_spent
FROM Orders
GROUP BY customer_id
ORDER BY total_spent DESC;

Orders are grouped by customer_id, so COUNT and SUM are computed once per customer rather than across the whole table.

GROUP BY with HAVING: customers with more than one order

SELECT customer_id, COUNT(*) AS order_count
FROM Orders
WHERE status = 'Paid'
GROUP BY customer_id
HAVING COUNT(*) > 1;

WHERE removes unpaid orders before grouping; HAVING then keeps only the groups whose row count exceeds one.

Follow-up questions to expect

What is the difference between WHERE and HAVING?
Why must every non-aggregated SELECT column appear in GROUP BY?
How does COUNT(*) differ from COUNT(column_name) inside a group?
How would you use GROUP BY to find duplicate rows in a table?

10 What is the difference between CHAR and VARCHAR?

CHAR(n) and VARCHAR(n) both store character strings up to n characters, but they differ in how they use storage. CHAR(n) is fixed length: every value occupies exactly n bytes, and any value shorter than n is right-padded with spaces to fill the column. VARCHAR(n) is variable length: it stores only the characters you actually put in, plus a small 2-byte length prefix that records how long the value is.

The practical rule of thumb: use CHAR when the length is genuinely fixed and consistent, such as a two-letter country code, a fixed-width status flag, or an ISO currency code. For those cases the padding is not wasted and CHAR can be marginally faster because rows are a predictable size. Use VARCHAR for anything whose length varies, such as names, email addresses, or descriptions, where padding every value out to the maximum would waste a lot of space.

Storage: CHAR(10) holding cat uses 10 bytes; VARCHAR(10) holding cat uses 3 bytes plus 2 bytes of overhead.
Trailing spaces: CHAR pads with spaces on the way in, so values often need RTRIM() when you read or compare them. VARCHAR preserves exactly what you stored.
Unicode: use NCHAR and NVARCHAR to store Unicode (multilingual) text. In modern SQL Server NVARCHAR is the usual default for text columns.

Both types reject values longer than n. If you try to insert a string that is too long, SQL Server raises error 8152, "String or binary data would be truncated", rather than silently chopping the value. You can also read more in SQL data types explained. Note that VARCHAR(MAX) is a special form that stores very large strings (up to 2 GB) for cases where the length is unbounded.

How CHAR pads and VARCHAR does not

CREATE TABLE Demo (
    code_char    CHAR(5),
    code_varchar VARCHAR(5)
);

INSERT INTO Demo (code_char, code_varchar) VALUES ('cat', 'cat');

SELECT DATALENGTH(code_char)    AS char_bytes,    -- 5
       DATALENGTH(code_varchar) AS varchar_bytes  -- 3
FROM Demo;

DATALENGTH shows CHAR always uses the full 5 bytes (padded with spaces) while VARCHAR uses only the 3 bytes actually stored.

Fixed-length CHAR vs variable-length NVARCHAR in a table

CREATE TABLE Country (
    iso_code CHAR(2)        NOT NULL,  -- always 2 letters, e.g. US, GB
    name     NVARCHAR(100)  NOT NULL   -- varies, and may be Unicode
);

-- Trailing pad on CHAR means you often trim on read:
SELECT RTRIM(iso_code) AS iso_code, name
FROM Country;

CHAR(2) fits a fixed country code with no waste; NVARCHAR(100) suits variable-length Unicode names. RTRIM removes CHAR padding.

Follow-up questions to expect

When would you choose CHAR over VARCHAR in a real schema?
What is the difference between VARCHAR and NVARCHAR?
What happens if you insert a string longer than the declared length?
What is VARCHAR(MAX) and when would you use it?

11 What does DISTINCT do and how do you remove duplicate rows?

DISTINCT is a keyword you place right after SELECT. It tells SQL Server to collapse the result set so that each unique combination of the selected columns appears only once. The key point most candidates miss is that DISTINCT applies to the entire row of selected columns, not to a single column. SELECT DISTINCT city, country dedupes on the pair of values, so two rows with the same city but different country both survive.

You can achieve the same de-duplication with GROUP BY: SELECT city FROM Customers GROUP BY city returns the same distinct values. The difference is intent. Use DISTINCT when you just want unique rows, and use GROUP BY when you also want to compute an aggregate per group (such as COUNT or SUM). Reaching for DISTINCT to hide accidental duplicates produced by a bad JOIN is a common code smell; fix the join instead.

DISTINCT also works inside aggregates. COUNT(DISTINCT column) counts how many unique non-NULL values a column holds, which is different from COUNT(column) (all non-NULL values) and COUNT(*) (all rows). This is how you answer questions like how many distinct customers placed an order.

Remember that DISTINCT is purely a read-time operation. It affects the output of a SELECT and never touches the stored data. If the goal is to physically remove duplicate rows from a table, DISTINCT will not do it. See how to delete duplicate rows for the pattern that actually removes them, usually a CTE with ROW_NUMBER.

SELECT DISTINCT: list each city once

SELECT DISTINCT city
FROM Customers
ORDER BY city;

Even if 500 customers live in London, London appears one time. DISTINCT dedupes across all selected columns, so add country to keep same-named cities apart.

COUNT(DISTINCT): how many unique customers ordered

SELECT COUNT(DISTINCT customer_id) AS unique_customers,
       COUNT(*)                  AS total_orders
FROM Orders;

COUNT(DISTINCT customer_id) counts unique non-NULL customers, while COUNT(*) counts every order row, so the two numbers differ when customers order more than once.

Follow-up questions to expect

What is the difference between DISTINCT and GROUP BY?
How does COUNT(DISTINCT col) differ from COUNT(col) and COUNT(*)?
Does DISTINCT remove duplicates from the table or only from the result?
How would you actually delete duplicate rows from a table?

12 What are aggregate functions in SQL?

An aggregate function takes many rows as input and returns a single value that summarizes them. The five you must know are COUNT (how many rows), SUM (total of a numeric column), AVG (mean), MIN (smallest value) and MAX (largest value).

Without a GROUP BY, an aggregate runs over the entire result set and returns one row for the whole table. Add GROUP BY and the rows are split into groups first, so you get one summary row per group. This is why aggregates and grouping are almost always taught together.

A key detail interviewers probe is NULL handling: every aggregate except COUNT(*) ignores NULLs. So AVG(salary) divides the total by the count of non-NULL salaries, not by the total number of rows. If you want NULLs treated as zero, convert them first with something like AVG(ISNULL(salary, 0)).

COUNT(*) counts every row, including rows that are entirely or partly NULL.
COUNT(col) counts only rows where col is not NULL.
COUNT(DISTINCT col) counts the number of distinct non-NULL values in col.

Any non-aggregated column in the SELECT list must appear in the GROUP BY clause. To filter on the result of an aggregate you use HAVING, not WHERE, because WHERE is applied before rows are grouped. See WHERE vs HAVING for that distinction.

Aggregates over the whole table

SELECT COUNT(*)        AS total_rows,
       COUNT(email)    AS rows_with_email,
       AVG(salary)     AS avg_salary,
       MAX(salary)     AS top_salary
FROM Employees;

One summary row for the whole table. COUNT(email) skips NULL emails; AVG(salary) ignores NULL salaries.

Per-group results with GROUP BY and HAVING

SELECT department_id,
       COUNT(*)                 AS headcount,
       COUNT(DISTINCT job_title) AS distinct_roles,
       SUM(salary)              AS payroll
FROM Employees
GROUP BY department_id
HAVING COUNT(*) > 5
ORDER BY payroll DESC;

One row per department; HAVING filters on the aggregate to keep only departments with more than 5 people.

Follow-up questions to expect

What is the difference between COUNT(*), COUNT(column) and COUNT(DISTINCT column)?
Why do you use HAVING instead of WHERE to filter an aggregate?
How do aggregate functions treat NULL values?
Which columns are you allowed to put in the SELECT list alongside an aggregate?

13 In what order does SQL logically process the clauses of a SELECT?

You write a query in one order, but SQL Server logically processes it in another. The engine is free to optimise the physical plan however it likes, but the result must match this logical sequence of steps, where each step produces a virtual table that feeds the next:

FROM / JOIN - identify the source tables and combine them.
WHERE - filter individual rows (no aggregates or aliases yet).
GROUP BY - collapse rows into groups.
HAVING - filter the groups (this is where aggregate conditions go).
SELECT - evaluate the output expressions, including window functions, and assign column aliases.
DISTINCT - remove duplicate rows.
ORDER BY - sort the final result.
TOP / OFFSET-FETCH - limit how many rows are returned.

The key insight is that WHERE runs long before SELECT. When WHERE is evaluated the column aliases defined in the SELECT list do not exist yet, so referencing one raises error 207, "Invalid column name". The same is true for GROUP BY and HAVING, which also run before SELECT.

ORDER BY is the exception: it is the last logical step and runs after SELECT, so by then the aliases have been assigned and you can sort by them freely. This single rule explains most "why can I use my alias here but not there" questions.

This is also why an aggregate filter belongs in HAVING and not WHERE (see WHERE vs HAVING): the groups the aggregate summarises do not exist until GROUP BY has run, which is after WHERE.

Alias in WHERE fails (error 207)

SELECT unit_price * quantity AS line_total
FROM OrderLines
WHERE line_total > 100;   -- Msg 207: Invalid column name 'line_total'

WHERE runs before SELECT, so the alias does not exist yet. Repeat the full expression, or wrap the query in a CTE or subquery.

The same alias works in ORDER BY

SELECT unit_price * quantity AS line_total
FROM OrderLines
WHERE unit_price * quantity > 100
ORDER BY line_total DESC;

ORDER BY is the last logical step and runs after SELECT, so the alias is already defined and can be used to sort.

Follow-up questions to expect

Why must an aggregate filter go in HAVING instead of WHERE?
At what point are window functions evaluated, and can you filter on one in WHERE?
If aliases are not available in WHERE, how can you filter on a computed column?
Does the logical processing order match how the optimiser physically executes the query?

14 What is a self join and when would you use one?

A self join is not a special kind of join like INNER or LEFT; it is any join where a table is joined to itself. Because you cannot reference the same table name twice without confusion, a self join always needs two table aliases so the query engine can treat each copy as a separate logical table.

You reach for a self join whenever a row in a table is related to another row in the same table. The two classic cases are:

Hierarchies stored in one table. An Employees table where each row has a manager_id that points at the emp_id of another row in the same table. Joining the table to itself pairs each employee with their manager.
Comparing rows to other rows. Finding pairs of records that share a value (same city, same order date) or that differ in some way, by matching one alias against the other.

The choice of join type still matters. An INNER JOIN back to the same table drops any row that has no partner, so an employee with no manager (the CEO at the top of the tree) disappears. Use a LEFT JOIN when you want to keep the top of a hierarchy, filling NULL for the missing manager. To go beyond a single level (manager of a manager of a manager), you typically switch to a recursive CTE rather than stacking many self joins.

See the general join question for how self joins fit alongside the other join types, and note that self joins commonly appear in questions like finding employees who earn more than their manager.

Self join: pair each employee with their manager

SELECT e.name AS employee, m.name AS manager
FROM Employees AS e
LEFT JOIN Employees AS m
    ON e.manager_id = m.emp_id
ORDER BY manager, employee;

The table is aliased twice (e and m). LEFT JOIN keeps the top boss, whose manager_id is NULL, instead of dropping that row.

Self join: find pairs of employees in the same city

SELECT a.name AS employee_a, b.name AS employee_b, a.city
FROM Employees AS a
INNER JOIN Employees AS b
    ON a.city = b.city
   AND a.emp_id < b.emp_id
ORDER BY a.city;

Matching a.emp_id < b.emp_id lists each pair once and stops a row from pairing with itself.

Follow-up questions to expect

Why does a self join require table aliases?
When would you use a LEFT self join instead of an INNER self join?
How would you list managers of managers across many levels?
How do you stop a self join from matching a row with itself or listing a pair twice?

15 What is the difference between WHERE and HAVING?

Both WHERE and HAVING filter data, but they act at different stages of the query. WHERE runs first: it filters individual rows before any grouping happens, so it can only test columns and expressions on a single row. Because grouping has not occurred yet, WHERE cannot use aggregate functions such as COUNT(), SUM() or AVG().

HAVING runs later: it filters the groups produced by GROUP BY, after aggregation. That is why HAVING can reference aggregate functions, for example keeping only the groups where COUNT(*) > 5. If you try to put an aggregate in WHERE, SQL Server raises an error because the aggregate is not defined for a single row.

WHERE - filters rows before GROUP BY; no aggregates allowed.
HAVING - filters groups after GROUP BY; aggregates are allowed.
You often use both together: WHERE to discard unwanted rows early, then HAVING to discard unwanted groups.

This ordering is a direct consequence of how SQL logically processes a query: FROM, then WHERE, then GROUP BY, then HAVING, then SELECT. For a step by step walk through of that flow, see ORDER BY, GROUP BY and HAVING explained.

There is also a performance reason to filter early. Rows removed by WHERE never have to be grouped or aggregated, so put a condition in WHERE whenever it does not depend on an aggregate. Save HAVING for conditions that genuinely need the grouped result. A well placed WHERE also lets the optimizer use an index to seek instead of scanning the whole table.

WHERE and HAVING together

SELECT customer_id, COUNT(*) AS order_count, SUM(total) AS total_spent
FROM Orders
WHERE order_date >= '2025-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5;

WHERE keeps only 2025 orders before grouping; HAVING then keeps only customers with more than five of those orders.

Aggregate in WHERE fails; use HAVING instead

-- This errors: an aggregate is not allowed in WHERE
-- SELECT customer_id FROM Orders WHERE SUM(total) > 1000 GROUP BY customer_id;

-- Correct: filter the grouped result with HAVING
SELECT customer_id, SUM(total) AS total_spent
FROM Orders
GROUP BY customer_id
HAVING SUM(total) > 1000;

The condition depends on an aggregate, so it belongs in HAVING, which runs after GROUP BY.

Follow-up questions to expect

Can you use a column alias defined in SELECT inside HAVING?
In what logical order does SQL Server process WHERE, GROUP BY, HAVING and SELECT?
If a condition does not use an aggregate, should it go in WHERE or HAVING, and why?
Can HAVING be used without a GROUP BY clause?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.

Get 1-on-1 Prep All Categories