Home SQL Interview Questions Design
SQL Interview Prep

Database Design Interview Questions

Design questions test how you model real problems: relationships, keys, data types and ER diagrams. Interviewers want to see structured thinking. Each answer includes an ER-style diagram and the SQL that implements the design.

Design 15 questions
1 How do you model a one to many relationship?

A one to many relationship means one row in the parent (the "one" side) can be linked to many rows in the child (the "many" side), while each child row links back to exactly one parent. The classic example is Customers 1 - N Orders: a single customer can place many orders, but each order belongs to just one customer.

You model this by placing a foreign key on the many side. The child table (Orders) gets a customer_id column that references the primary key of the parent table (Customers). You never store the relationship on the one side, because a single column cannot hold a list of child rows; the pointer always lives on the child.

The foreign key does more than document intent, it is enforced. SQL Server rejects any insert or update that sets Orders.customer_id to a value that does not exist in Customers, so you can never end up with an orphaned order that points at a customer who was never created. Trying to insert such a row raises error 547, the foreign key conflict.

  • ON DELETE NO ACTION (the default) blocks deleting a customer while orders still reference it, protecting history.
  • ON DELETE CASCADE automatically deletes the child orders when the parent customer is removed.
  • ON DELETE SET NULL keeps the orders but sets their customer_id to NULL, which requires the column to be nullable.

In practice you almost always want the default NO ACTION for records like orders, so financial history is never silently destroyed. CASCADE fits owned detail rows (like order line items) where a child has no meaning without its parent.

Customerscustomer_id PKnameOrdersorder_id PKcustomer_id FK1manyOne row in Customers relates to many rows in Orders
Parent table: Customers (the one side)
CREATE TABLE Customers (
    customer_id INT IDENTITY(1,1) PRIMARY KEY,
    name        NVARCHAR(100) NOT NULL,
    email       NVARCHAR(255) NOT NULL
);

customer_id is the primary key that the child table will point back to.

Child table: Orders with a foreign key (the many side)
CREATE TABLE Orders (
    order_id    INT IDENTITY(1,1) PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date  DATE NOT NULL DEFAULT SYSUTCDATETIME(),
    total       DECIMAL(10,2) NOT NULL,
    CONSTRAINT FK_Orders_Customers FOREIGN KEY (customer_id)
        REFERENCES Customers (customer_id)
        ON DELETE NO ACTION
);

customer_id is NOT NULL so every order must name a customer, and the FK guarantees that customer exists.

Follow-up questions to expect
  • Why does the foreign key go on the many side and not the one side?
  • What happens if you try to delete a customer that still has orders?
  • When would you choose ON DELETE CASCADE over NO ACTION?
  • How would you change this design to support a many to many relationship?

Related: What is a foreign key and what does it enforce? · How do you model a many to many relationship? · When and how do you model a one to one relationship?

2 How do you model a many to many relationship?

A many to many (M:N) relationship means a row on each side can relate to many rows on the other. A student can enroll in many courses, and a course can have many students. A relational table cannot express this directly: you would have to stuff a list of course ids into a single Students column, which breaks first normal form and makes joins, constraints, and indexing impossible.

The standard solution is to add a third table, called a junction table (also known as a bridge, link, or associative table). It sits between the two entities and holds one foreign key pointing at the primary key of each side. Every row in the junction represents a single pairing, such as one student in one course.

  • Two foreign keys: one referencing the parent on each side, for example student_id and course_id.
  • Composite primary key: the pair (student_id, course_id) is declared as the primary key, which guarantees the same student cannot be enrolled in the same course twice.
  • Two one to many links: the junction turns the single M:N relationship into two ordinary one to many relationships that a database can enforce with real constraints.

A key benefit is that the junction table is a natural home for attributes of the relationship itself. An enrollment is not just a link, it has its own facts: a grade and an enrolled_date belong on the junction row because they describe the pairing, not the student or the course alone.

So Students and Courses each stay lean, and an Enrollments table in the middle records who is in what, when they joined, and how they did. Modeling M:N this way is one of the most common data design tasks an interviewer will probe, so be ready to name the junction table, its two foreign keys, and its composite key.

Studentsstudent_id PKnameEnrollmentsstudent_id FKcourse_id FKCoursescourse_id PKtitleA junction table turns M:N into two 1:N relationships
The two entity tables: Students and Courses
CREATE TABLE Students (
    student_id  INT           PRIMARY KEY,
    full_name   NVARCHAR(100) NOT NULL
);

CREATE TABLE Courses (
    course_id   INT           PRIMARY KEY,
    title       NVARCHAR(100) NOT NULL
);

Each side is a plain table with its own primary key and no reference to the other.

The junction table with a composite key and extra columns
CREATE TABLE Enrollments (
    student_id    INT      NOT NULL,
    course_id     INT      NOT NULL,
    enrolled_date DATE     NOT NULL DEFAULT (GETDATE()),
    grade         CHAR(2)  NULL,
    CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
    CONSTRAINT FK_Enrollments_Student
        FOREIGN KEY (student_id) REFERENCES Students (student_id),
    CONSTRAINT FK_Enrollments_Course
        FOREIGN KEY (course_id)  REFERENCES Courses (course_id)
);

The composite primary key (student_id, course_id) blocks duplicate enrollments, and grade and enrolled_date describe the pairing itself.

Follow-up questions to expect
  • Why use a composite primary key on the junction instead of a separate surrogate id?
  • How would you query which courses a given student is enrolled in?
  • What happens to the design if an enrollment needs its own history or status over time?
  • How do the two foreign keys keep the junction rows from pointing at students or courses that do not exist?

Related: What is a junction table and when do you need one? · How do you model a one to many relationship? · What is a composite key and when should you use one?

3 When and how do you model a one to one relationship?

A one to one relationship means a row in one table matches at most one row in another, and vice versa. You model it by having both tables share the same key: the child table stores the parent key as its own primary key, and that same column is also a foreign key pointing back at the parent. Because the child key is unique (it is the PK) and references the parent, each parent row can have at most one child row.

The key design choice is the UNIQUE constraint. In a one to many relationship the foreign key column allows duplicates; in a one to one you enforce uniqueness by making that foreign key column the primary key of the child table. That single decision is what turns "many" into "one".

When is splitting worth it? The common reasons are:

  • Optional or rarely used columns. Attributes that are populated for only a small fraction of rows can live in a separate table so the main table stays narrow and avoids many NULLs.
  • Security sensitive columns. Isolating data like national ID numbers, salary, or password hashes into a child table lets you grant access to it separately with table level permissions.
  • Splitting a wide table (vertical partitioning). Moving large or infrequently read columns (long text, images, blobs) into a side table keeps the hot table smaller, so more rows fit per page and common scans do less I/O.

The alternative is almost always to keep everything in one table. A genuine one to one split adds a join to every query that needs both halves, so it is the less common relationship type. Reach for it only when a concrete driver above applies; if you cannot name one, prefer the single table. When you do split, define the child key as PRIMARY KEY and FOREIGN KEY together, and cascade deletes if the child cannot exist without its parent.

Usersuser_id PKemailUserProfilesuser_id PKbio11Each user has exactly one profile
Parent and child sharing one key
CREATE TABLE Users (
    user_id   INT           NOT NULL IDENTITY(1,1),
    email     VARCHAR(255)  NOT NULL,
    CONSTRAINT PK_Users PRIMARY KEY (user_id),
    CONSTRAINT UQ_Users_email UNIQUE (email)
);

CREATE TABLE UserProfiles (
    user_id     INT           NOT NULL,
    ssn         CHAR(11)      NULL,
    bio         NVARCHAR(MAX) NULL,
    -- user_id is BOTH the primary key and a foreign key,
    -- so each user has at most one profile row.
    CONSTRAINT PK_UserProfiles PRIMARY KEY (user_id),
    CONSTRAINT FK_UserProfiles_Users FOREIGN KEY (user_id)
        REFERENCES Users (user_id) ON DELETE CASCADE
);

The child PK is also the FK, so the shared user_id can appear at most once per user, which enforces one to one.

Reading both halves back with a join
SELECT u.user_id, u.email, p.ssn, p.bio
FROM Users AS u
LEFT JOIN UserProfiles AS p
    ON p.user_id = u.user_id;

A LEFT JOIN returns every user and fills NULLs when no profile row exists yet, which is common when the child columns are optional.

Follow-up questions to expect
  • How is this different from a one to many relationship at the schema level?
  • Why not just add these columns to the Users table instead?
  • How would you enforce that every user must have exactly one profile row?
  • What are the performance trade offs of splitting a wide table this way?

Related: How do you model a one to many relationship? · How do you choose the right data type for a column? · How do you choose a good primary key?

4 What is an entity relationship (ER) diagram?

An entity relationship (ER) diagram is a picture of a database design. It captures three things: entities (the things you store data about, which usually become tables), attributes (the properties of each entity, which become columns) and relationships (how entities connect to each other). Drawing this before writing any SQL lets a team agree on the structure and spot problems while they are still cheap to fix.

Each relationship carries a cardinality: how many rows on one side can relate to rows on the other. This is shown with crow's foot notation, where the symbols at the end of a line mean "one" (a single bar) or "many" (the splayed three-pronged foot). The common shapes are one-to-one, one-to-many and many-to-many.

An ER diagram maps directly onto a real relational schema:

  • An entity becomes a CREATE TABLE.
  • An attribute becomes a column, often with a type and constraints.
  • The entity identifier becomes a primary key.
  • A one-to-many relationship becomes a foreign key on the "many" side that points at the primary key on the "one" side.
  • A many-to-many relationship cannot be stored directly, so it becomes a third junction table holding a foreign key to each side.

Because it doubles as a communication tool and a build plan, the ER diagram is where design decisions like normalization are made visible. A well drawn diagram tells you exactly which tables, keys and foreign keys to create. You can read a fuller walkthrough in Entity Relationship Diagrams (ERD) Explained.

Customerscustomer_id PKnameOrdersorder_id PKcustomer_id FK1manyOne row in Customers relates to many rows in Orders
One-to-many: a Customer has many Orders
CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    name        NVARCHAR(100) NOT NULL
);

CREATE TABLE Orders (
    order_id    INT PRIMARY KEY,
    customer_id INT NOT NULL,
    total       DECIMAL(10,2) NOT NULL,
    CONSTRAINT FK_Orders_Customers
        FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
);

The foreign key lives on the many side (Orders) and points at the primary key on the one side (Customers).

Many-to-many: Students and Courses via a junction table
CREATE TABLE Students (
    student_id INT PRIMARY KEY,
    name       NVARCHAR(100) NOT NULL
);

CREATE TABLE Courses (
    course_id INT PRIMARY KEY,
    title     NVARCHAR(100) NOT NULL
);

CREATE TABLE Enrollments (
    student_id INT NOT NULL,
    course_id  INT NOT NULL,
    CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
    FOREIGN KEY (student_id) REFERENCES Students(student_id),
    FOREIGN KEY (course_id)  REFERENCES Courses(course_id)
);

A many-to-many relationship becomes a third table whose composite key holds a foreign key to each side.

Follow-up questions to expect
  • How does a many-to-many relationship become tables in a real schema?
  • What is the difference between logical and physical ER models?
  • How do you show optional versus mandatory participation in crow's foot notation?
  • What does an ER diagram look like after normalizing to third normal form?

Related: How do you model a one to many relationship? · How do you model a many to many relationship? · What is normalization and why do we do it?

5 What is the difference between a surrogate key and a natural key?

Both are candidates for a table's primary key, but they come from opposite places. A natural key is a column (or set of columns) drawn from the real world that already uniquely identifies a row, for example an email address, an ISBN, a country code or a Social Security number. A surrogate key is an artificial value the system generates purely to identify the row, such as an IDENTITY column, a SEQUENCE value or a GUID. It carries no business meaning at all.

The core trade-off is meaning versus stability. A natural key is readable and self-explanatory, and it enforces a genuine business rule for free. Its weaknesses are that business data can change (people change email or surname, a product gets re-coded) and it is often wide or composite, which makes it a bulky value to copy into every foreign key and index.

A surrogate key is stable because nothing in the business ever forces it to change, and it is narrow and fast: a 4-byte INT makes a compact clustered index and a cheap foreign key. Its downside is that it is meaningless, so a join is always required to make sense of the row, and by itself it does not stop duplicate business data.

  • Natural key: meaningful, no extra column, but can change and may be wide or composite.
  • Surrogate key: stable, narrow, fast to join and index, but meaningless and needs an extra column.

The common professional pattern is to use both: a surrogate key as the primary key that foreign keys reference, plus a separate UNIQUE constraint on the natural key so the database still rejects duplicate real-world values. Skipping that unique constraint is the classic mistake, because a surrogate key alone lets you insert the same customer twice.

Primary Keyuniquely IDs a rowUnique Keyno duplicate valuesForeign Keypoints to another PKComposite Keykey over 2+ columnsDifferent constraints enforce different guarantees
Surrogate key: system-generated IDENTITY as the primary key
CREATE TABLE Customers (
    CustomerId   INT IDENTITY(1,1) PRIMARY KEY,
    Email        VARCHAR(255) NOT NULL,
    FullName     VARCHAR(120) NOT NULL,
    CONSTRAINT UQ_Customers_Email UNIQUE (Email)
);

CustomerId is meaningless and stable, so foreign keys point at it; the UNIQUE constraint on Email still blocks duplicate customers.

Natural key: the business value itself is the primary key
CREATE TABLE Countries (
    CountryCode  CHAR(2) PRIMARY KEY,   -- ISO code, e.g. 'US', 'GB'
    CountryName  VARCHAR(80) NOT NULL
);

A small, truly stable natural key like an ISO country code is a good case for skipping a surrogate.

Follow-up questions to expect
  • When would you still prefer a natural key over a surrogate key?
  • Why should you add a UNIQUE constraint on the natural key when using a surrogate?
  • What are the trade-offs of using a GUID versus an IDENTITY column for a surrogate key?
  • How do surrogate keys affect foreign keys and index size in a large schema?

Related: How do you choose a good primary key? · What is the difference between a candidate key and a primary key? · What is the difference between a primary key and a unique key?

6 How do you choose a good primary key?

A primary key uniquely identifies every row in a table and is referenced by foreign keys in other tables, so the choice ripples across the whole schema. Judge every candidate against four hard criteria and one strong preference.

  • Unique: the value must identify exactly one row. The database enforces this for you.
  • Never NULL: a primary key column cannot be NULL, because a missing value cannot identify a row.
  • Stable: the value should never change once assigned. Updating a key forces cascading updates to every foreign key that references it.
  • Narrow: prefer a small data type such as INT or BIGINT. The key is copied into every foreign key and every nonclustered index, so a wide key inflates storage and slows joins and lookups.
  • Preferably meaningless: a value with no business meaning never needs to change when the business does.

Those criteria push you toward a surrogate key: a system generated, meaningless integer produced by an IDENTITY column or a SEQUENCE. It is narrow, guaranteed unique, and never has a business reason to change. A natural key uses real business data such as an email address or product code. It can work when a genuinely stable and compact column already exists, but business values tend to change and are often wider than an integer, which is why surrogate keys are the common default.

A GUID (UNIQUEIDENTIFIER) is another surrogate option. Its advantage is that it is globally unique, so you can generate keys on many clients or servers without collisions, which suits distributed systems and merge replication. The trade offs are that it is wide (16 bytes versus 4 for an INT) and, when generated randomly with NEWID(), its values arrive out of order and cause heavy fragmentation of a clustered index and frequent page splits. If a GUID must be the clustered key, use sequential values from NEWSEQUENTIALID() so inserts append in order.

Remember that the primary key and the clustered index are related but separate choices. By default SQL Server makes the primary key the clustered index, and the clustering key is stored in every nonclustered index, so a narrow ever increasing key keeps those indexes small and inserts cheap. See SQL indexes for how the clustered index shapes physical row order, and primary key vs foreign key for how keys tie tables together.

Departmentsdept_id PKdept_nameEmployeesemp_id PKdept_id FKEmployees.dept_id (FK) references Departments.dept_id (PK)
Surrogate key with an INT IDENTITY column
CREATE TABLE Customers (
    customer_id INT IDENTITY(1,1) NOT NULL,
    email       NVARCHAR(256) NOT NULL,
    full_name   NVARCHAR(100) NOT NULL,
    CONSTRAINT PK_Customers PRIMARY KEY (customer_id),
    CONSTRAINT UQ_Customers_email UNIQUE (email)
);

The narrow, meaningless customer_id is the primary key; the natural email value gets a UNIQUE constraint instead of being the key so it can change without breaking foreign keys.

GUID key: use a sequential value when it is clustered
CREATE TABLE Devices (
    device_id UNIQUEIDENTIFIER NOT NULL
        CONSTRAINT DF_Devices_id DEFAULT NEWSEQUENTIALID(),
    serial_no NVARCHAR(64) NOT NULL,
    CONSTRAINT PK_Devices PRIMARY KEY CLUSTERED (device_id)
);

NEWSEQUENTIALID() hands out increasing GUIDs so inserts append at the end of the clustered index, avoiding the page splits and fragmentation a random NEWID() would cause.

Follow-up questions to expect
  • When would you choose a natural key over a surrogate key?
  • Why can a random GUID hurt performance as a clustered primary key?
  • Is the primary key always the clustered index in SQL Server?
  • How does the width of a primary key affect nonclustered indexes?

Related: What is the difference between a surrogate key and a natural key? · What is a primary key? · What is a composite key and when should you use one?

7 What is a composite key and when should you use one?

A composite key (also called a compound key) is a primary key or unique key built from two or more columns. No single column on its own is unique, but the combination of columns is guaranteed unique, so together they identify each row. You declare it by listing the columns in one PRIMARY KEY or UNIQUE constraint rather than marking a single column.

The classic fit is a junction table that resolves a many-to-many relationship. In an Enrollments table, a student can take many courses and a course has many students, so neither student_id nor course_id is unique by itself. The pair (student_id, course_id) is the natural identity of an enrollment, and making it the primary key also stops the same student being enrolled in the same course twice.

Column order matters. A composite PRIMARY KEY is backed by an index, and that index is ordered by the first column, then the second, and so on (like the alphabetical order of a phone book by last name then first name). A query that filters on student_id can seek that index, but a query that filters only on course_id cannot, because the second column is not usefully sorted on its own. Put the column you filter by most, or the more selective one, first, and add a separate index for the other access pattern if you need it.

  • Use a composite key when the row has a natural multi-column identity, above all in junction tables, or when a business rule requires a combination to be unique.
  • Prefer a single surrogate key (one auto-generated INT IDENTITY or GUID) when the natural key is wide, changes over time, or is referenced by many other tables. A slim single-column key makes foreign keys, joins and application code simpler, and you can still add a UNIQUE constraint on the natural columns to keep the rule enforced.

In short, reach for a composite key when a combination of columns is the true identity of the row; reach for a surrogate key when you want a small, stable handle that is easy for the rest of the schema to point at.

Primary Keyuniquely IDs a rowUnique Keyno duplicate valuesForeign Keypoints to another PKComposite Keykey over 2+ columnsDifferent constraints enforce different guarantees
Composite PRIMARY KEY on a junction table
CREATE TABLE Enrollments (
    student_id INT NOT NULL,
    course_id  INT NOT NULL,
    enrolled_on DATE NOT NULL DEFAULT SYSUTCDATETIME(),
    CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
    CONSTRAINT FK_Enroll_Student FOREIGN KEY (student_id) REFERENCES Students(student_id),
    CONSTRAINT FK_Enroll_Course  FOREIGN KEY (course_id)  REFERENCES Courses(course_id)
);

The pair (student_id, course_id) is unique, so a student cannot enrol in the same course twice. The key columns are also the two foreign keys.

Column order drives which lookups can seek the index
-- Seeks the PK index (leads with student_id):
SELECT course_id FROM Enrollments WHERE student_id = 42;

-- Cannot seek the PK index efficiently (course_id is the 2nd column):
SELECT student_id FROM Enrollments WHERE course_id = 7;

-- Add a supporting index for the reverse access pattern:
CREATE INDEX IX_Enrollments_Course ON Enrollments (course_id, student_id);

The composite PK is ordered by student_id first, so filtering only on course_id needs its own index.

Follow-up questions to expect
  • Why does the column order in a composite primary key affect query performance?
  • When would you replace a composite natural key with a single surrogate key?
  • How does a foreign key reference a table that has a composite primary key?
  • What is the difference between a composite key and a compound index?

Related: How do you model a many to many relationship? · Why does the column order in a composite index matter? · How do you choose a good primary key?

8 How do you model hierarchical data such as an org chart?

An org chart is a tree: each employee reports to exactly one manager, and a manager can have many reports. The default and most common way to store this is the adjacency list, where the table holds a parent_id column that is a foreign key pointing back at the same table primary key. One extra column captures the entire hierarchy, no matter how deep.

The adjacency list is trivial to write to: inserting an employee or moving them under a new manager is a single-row change. The trade-off is reads. To answer "who are all the reports beneath this manager?" you must walk an unknown number of levels, which is where a recursive CTE comes in. A recursive CTE has an anchor member (the starting row) and a recursive member that repeatedly joins back to the CTE until no more rows are found.

When you point a foreign key at its own table you should think carefully about the ON DELETE rule. SQL Server refuses a self-referencing ON DELETE CASCADE because it could form a cycle, raising error 1785. Leave the action as NO ACTION and reparent or delete children explicitly.

Other models exist and are worth naming to show breadth:

  • Path enumeration stores the full ancestry as a string such as /1/4/9/. Subtree reads become a fast LIKE prefix scan, but the path must be rewritten whenever a node moves and referential integrity is weaker.
  • Nested sets give every node a left and right number so a subtree is a single range query. Reads are very fast, but almost any insert renumbers large parts of the table, so writes are expensive.
  • SQL Server hierarchyid is a built-in CLR data type that encodes the path compactly and ships with methods like GetAncestor() and IsDescendantOf(). It is convenient but SQL Server specific.

For most applications the guidance is simple: start with an adjacency list and a recursive CTE, and only reach for nested sets or hierarchyid when read patterns on very deep or very wide trees prove it necessary.

Employees eemp_id PKnamemanager_id FKEmployees memp_id PKnamemanager_id FKe.manager_id = m.emp_idOne table joined to an aliased copy of itself
Adjacency list: a self-referencing table
CREATE TABLE Employees (
    emp_id     INT          NOT NULL PRIMARY KEY,
    name       NVARCHAR(100) NOT NULL,
    manager_id INT          NULL,
    CONSTRAINT FK_Employees_Manager
        FOREIGN KEY (manager_id) REFERENCES Employees (emp_id)
);

manager_id points back at emp_id in the same table. The CEO row has manager_id = NULL. Do not use ON DELETE CASCADE here (error 1785).

Recursive CTE: walk the tree under one manager
WITH OrgChart AS (
    -- anchor: the starting manager
    SELECT emp_id, name, manager_id, 0 AS depth
    FROM Employees
    WHERE emp_id = 1

    UNION ALL

    -- recursive: everyone who reports to a row already found
    SELECT e.emp_id, e.name, e.manager_id, oc.depth + 1
    FROM Employees AS e
    INNER JOIN OrgChart AS oc
        ON e.manager_id = oc.emp_id
)
SELECT emp_id, name, depth
FROM OrgChart
ORDER BY depth, name;

The depth column shows how many levels below the starting manager each employee sits. Add OPTION (MAXRECURSION 100) to cap runaway recursion.

Follow-up questions to expect
  • Write a recursive CTE that returns the full management chain above a given employee.
  • Why does SQL Server reject a self-referencing ON DELETE CASCADE?
  • When would nested sets or hierarchyid beat a plain adjacency list?
  • How would you prevent or detect a cycle where an employee ends up as their own manager?

Related: What is a recursive CTE and when do you need one? · What is a self join and when would you use one? · How do you model a one to many relationship?

9 What is the difference between a star schema and a snowflake schema?

A star schema and a snowflake schema are both dimensional models used in data warehousing and OLAP (analytical) workloads. Each is organized around a central fact table that stores measurable events (sales, clicks, shipments) as numeric measures plus foreign keys, surrounded by dimension tables that describe those facts (product, customer, date, store). The difference is purely in how the dimension tables themselves are structured.

In a star schema each dimension is a single, denormalized table. A Product dimension holds the product name, category, and brand all in one row, even though category and brand repeat across many products. The layout looks like a star: the fact table in the middle with dimensions radiating out one join away. Because any query touches only the fact table and one level of dimensions, joins are few and the query optimizer has an easy job, so star schemas are fast to read and simple to understand.

A snowflake schema takes those dimensions and normalizes them into related sub-tables. The Product dimension is split so that category and brand live in their own lookup tables, referenced by foreign keys. This removes repeated text and saves storage, but a query now has to traverse several joins (fact to product, product to category, category to department) to get the same descriptive columns, which makes queries more complex and usually slower.

  • Star: denormalized dimensions, fewer joins, simpler queries, faster reads, some data redundancy.
  • Snowflake: normalized dimensions in sub-tables, more joins, less redundancy, easier to maintain shared attributes, slower analytical queries.

The choice reflects a broader split between OLTP and OLAP thinking. Transactional (OLTP) systems favor heavy normalization to avoid update anomalies when data changes constantly. Analytical (OLAP) warehouses are read-mostly and load data in controlled batches, so the classic advice is to denormalize for query speed and prefer a star schema, reaching for snowflaking only when a dimension is very large or its shared attributes genuinely need to be managed in one place. See normalization vs denormalization for the underlying trade-off.

Fact: Salesmeasures + FKsDim: DateDim: ProductDim: CustomerDim: StoreOne fact table surrounded by dimension tables
Star schema: one join per dimension
SELECT p.category,
       d.calendar_year,
       SUM(f.sales_amount) AS total_sales
FROM FactSales AS f
INNER JOIN DimProduct AS p
    ON p.product_key = f.product_key
INNER JOIN DimDate AS d
    ON d.date_key = f.date_key
GROUP BY p.category, d.calendar_year;

category lives directly on DimProduct, so the fact table joins each dimension exactly once.

Snowflake schema: the same result needs an extra join
SELECT c.category_name,
       d.calendar_year,
       SUM(f.sales_amount) AS total_sales
FROM FactSales AS f
INNER JOIN DimProduct AS p
    ON p.product_key = f.product_key
INNER JOIN DimCategory AS c
    ON c.category_key = p.category_key
INNER JOIN DimDate AS d
    ON d.date_key = f.date_key
GROUP BY c.category_name, d.calendar_year;

category has been normalized into DimCategory, so reaching category_name requires a second hop through DimProduct.

Follow-up questions to expect
  • When would you choose to snowflake a dimension instead of keeping it flat?
  • How does a star schema help query performance compared to a fully normalized OLTP model?
  • What is a slowly changing dimension, and how do you handle one?
  • What is the grain of a fact table and why does defining it matter?

Related: When should a schema be normalized versus denormalized? · When is it acceptable to denormalize? · How do you model a one to many relationship?

10 How do you choose the right data type for a column?

Choosing a data type is a design decision, not a formality. The right type enforces correctness (you cannot store a name in an INT), and it controls how much space each row uses on disk and in memory. The guiding rule is simple: pick the smallest type that safely fits every value the column will ever hold. Narrower columns produce smaller rows, smaller indexes and less I/O, so more data fits on each page and queries run faster.

  • Whole numbers use INT (about +/-2.1 billion) or BIGINT when you will exceed that. Use TINYINT or SMALLINT for small bounded ranges.
  • Money and exact decimals use DECIMAL(p, s) (also spelled NUMERIC). Never use FLOAT or REAL for money: they are binary floating point and cannot represent values like 0.10 exactly, so sums drift by rounding error.
  • Text uses VARCHAR(n) for ASCII and NVARCHAR(n) when the column must hold Unicode (names, emoji, non-Latin scripts). Size n to the real maximum; avoid a blanket VARCHAR(MAX) or huge widths you never use.
  • True or false uses BIT, which SQL Server can pack several of into a single byte.
  • Dates and times use DATE for a calendar day and DATETIME2 for a timestamp. Prefer DATETIME2 over the legacy DATETIME: it has a wider range, higher precision and often uses fewer bytes.

Two choices deserve extra care. First, use DECIMAL, not FLOAT, for anything you add up or compare for equality, especially currency. Reserve FLOAT for scientific or approximate measurements where tiny rounding is acceptable. Second, use DATETIME2 instead of DATETIME in new designs; keep DATETIME only for compatibility with old columns.

Watch out for oversizing. Declaring VARCHAR(4000) for a column that never exceeds 50 characters wastes nothing on disk (varchar stores only what you put in it), but it inflates memory grants and the optimizer's size estimates, and it can push a row or index past size limits. Right-size the declared length to the data.

Finally, keep types consistent across a foreign key and its primary key. If Orders.customer_id is INT but Customers.customer_id is BIGINT, every join forces an implicit conversion, which can block index seeks and cause errors such as conversion error 245 or arithmetic overflow 8114. Match the type and length on both sides. For a fuller tour of the type system see SQL data types explained.

The dataGood typewhole numbers->INT / BIGINTmoney->DECIMALtext->VARCHAR / NVARCHARtrue or false->BITdate and time->DATE / DATETIME2Smallest type that safely fits = smaller rows and faster indexes.
Right-sized types for a typical table
CREATE TABLE dbo.Products (
    product_id  INT           NOT NULL PRIMARY KEY,
    sku         VARCHAR(20)   NOT NULL,
    name        NVARCHAR(100) NOT NULL,
    price       DECIMAL(10,2) NOT NULL,
    is_active   BIT           NOT NULL DEFAULT 1,
    created_at  DATETIME2(0)  NOT NULL DEFAULT SYSUTCDATETIME()
);

DECIMAL keeps money exact, NVARCHAR holds Unicode names, BIT stores the flag, and DATETIME2 timestamps the row.

Mismatched key types force an implicit conversion
-- Customers.customer_id is BIGINT but Orders.customer_id is INT
SELECT c.name, o.order_id
FROM Customers AS c
JOIN Orders    AS o
    ON o.customer_id = c.customer_id;
-- SQL Server must CONVERT one side per row, which can
-- prevent an index seek and risk error 245 / overflow 8114.

Declare a foreign key with the same type and length as the primary key it references to avoid this.

Follow-up questions to expect
  • Why is FLOAT a bad choice for storing currency?
  • What is the difference between VARCHAR and NVARCHAR, and when do you need NVARCHAR?
  • Why is DATETIME2 preferred over DATETIME in new designs?
  • How can mismatched data types on a join column hurt query performance?

Related: What is the difference between CHAR and VARCHAR? · How do you choose a good primary key? · When and how do you model a one to one relationship?

11 What is a junction table and when do you need one?

A junction table - also called a bridge or associative table - is the standard way to model a many to many relationship in a relational database. You need one whenever a row on each side can relate to many rows on the other side: a student takes many courses and a course has many students, or an order contains many products and a product appears in many orders.

A relational schema cannot express many to many directly, because a single foreign key column can only point at one parent row. The fix is to introduce a third table that sits between the two entities and resolves the many to many into two one to many relationships. Each parent has many junction rows, and each junction row points back to exactly one parent on each side.

  • The junction table holds a foreign key to each side (for example student_id and course_id), each referencing the primary key of its parent table.
  • Its primary key is usually the composite of those two foreign keys. A composite key on (student_id, course_id) guarantees a given pair can only be stored once, so a student cannot be enrolled in the same course twice.
  • The table can carry relationship attributes - data that belongs to the pairing rather than to either entity, such as an enrollment date, a grade, or a quantity on an order line.

Enforce referential integrity with FOREIGN KEY constraints on both columns and, ideally, an ON DELETE CASCADE so removing a parent cleans up its junction rows. If the pairing needs to be referenced by yet another table, or the natural key is wide, teams sometimes add a surrogate IDENTITY key and enforce the pair with a separate UNIQUE constraint instead.

Junction tables are a normal part of a normalized design. If you ever find yourself storing a comma separated list of ids in a single column, that is the signal you actually needed a junction table.

Studentsstudent_id PKnameEnrollmentsstudent_id FKcourse_id FKCoursescourse_id PKtitleA junction table turns M:N into two 1:N relationships
Junction table with a composite primary key
CREATE TABLE Enrollments (
    student_id INT NOT NULL,
    course_id  INT NOT NULL,
    CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
    CONSTRAINT FK_Enroll_Student FOREIGN KEY (student_id)
        REFERENCES Students (student_id),
    CONSTRAINT FK_Enroll_Course FOREIGN KEY (course_id)
        REFERENCES Courses (course_id)
);

The composite PRIMARY KEY on both foreign keys prevents the same student-course pair being inserted twice.

Junction table that also stores a relationship attribute
CREATE TABLE OrderLines (
    order_id   INT NOT NULL,
    product_id INT NOT NULL,
    quantity   INT NOT NULL DEFAULT (1),
    CONSTRAINT PK_OrderLines PRIMARY KEY (order_id, product_id),
    CONSTRAINT FK_Line_Order FOREIGN KEY (order_id)
        REFERENCES Orders (order_id) ON DELETE CASCADE,
    CONSTRAINT FK_Line_Product FOREIGN KEY (product_id)
        REFERENCES Products (product_id)
);

quantity belongs to the pairing, not to either parent, so it lives on the junction row alongside the two foreign keys.

Follow-up questions to expect
  • Why not just store a comma separated list of course ids in the Students table?
  • When would you add a surrogate key to a junction table instead of using the composite key?
  • How do you query which students are enrolled in a given course using the junction table?
  • What does ON DELETE CASCADE do on the junction table foreign keys?

Related: How do you model a many to many relationship? · What is a composite key and when should you use one? · How do you model a one to many relationship?

12 What is the difference between a soft delete and a hard delete?

Both approaches answer the question "how do we get rid of a record?", but they trade permanence for recoverability in opposite ways. A hard delete issues a real DELETE and the row physically leaves the table. A soft delete leaves the row in place and instead sets a marker column that means "treat this as gone", so every read has to filter the deleted rows out.

The marker is usually one of two shapes. An is_deleted bit (or status value) is the simplest: it answers only yes or no. A deleted_at datetime is richer because a non-NULL value both flags the row and records when it happened, which pairs well with audit history. Many teams add deleted_by alongside it.

Why choose soft delete? It preserves history for audits and undo, keeps referential integrity intact because child rows still point at a parent that exists, and lets you answer "what did this look like last month" questions. It is the natural fit when data is legally or operationally valuable.

  • Every query must filter. You have to add WHERE is_deleted = 0 (or deleted_at IS NULL) everywhere, and one forgotten filter leaks deleted data. A filtered view or row-level security helps.
  • Tables bloat. Rows never leave, so indexes grow and scans do more work over time; archiving old soft-deleted rows becomes its own job.
  • Unique constraints break. A normal UNIQUE on email blocks a user from re-registering an address that belongs to a soft-deleted row. The fix is a filtered unique index that only covers live rows.

When you still need a hard delete: compliance rules such as GDPR right to erasure require that personal data actually leaves the system, so a flag is not enough. A common pattern is to soft delete for day-to-day use and then run a real DELETE (or anonymization) job to purge data once a retention window closes. See also DELETE vs TRUNCATE vs DROP for how the removal itself behaves.

Soft deleteid 41 Ana liveid 42 Ben deletedid 43 Cara liverow stays, is_deleted = 1Hard deleteid 41 Ana live(row 42 removed)id 43 Cara liveDELETE removes it for good
Soft delete: mark the row and keep uniqueness on live rows only
-- Mark the row deleted instead of removing it
UPDATE Users
SET is_deleted = 1,
    deleted_at = SYSUTCDATETIME(),
    deleted_by = @current_user_id
WHERE user_id = 42;

-- Normal reads must exclude soft-deleted rows
SELECT user_id, email
FROM Users
WHERE is_deleted = 0;

-- Filtered unique index: email is unique among LIVE rows only,
-- so a soft-deleted email can be re-used by a new signup.
CREATE UNIQUE INDEX UX_Users_Email_Live
    ON Users (email)
    WHERE is_deleted = 0;

The row survives for audit and restore, and the filtered index keeps the unique rule without blocking re-registration of a deleted address.

Hard delete: remove the row permanently
-- The row physically leaves the table and cannot be filtered back in
DELETE FROM Users
WHERE user_id = 42;

Use this for a genuine purge, for example honoring a GDPR erasure request after any retention window has passed.

Follow-up questions to expect
  • How do you stop a UNIQUE constraint from blocking re-use of a soft-deleted value?
  • How would you make sure application queries never forget the is_deleted filter?
  • How does soft delete interact with foreign keys and cascading deletes?
  • How do you satisfy a GDPR erasure request if you soft delete by default?

Related: How do you model audit history or change tracking? · How do you enforce referential integrity? · What is the difference between DELETE, TRUNCATE and DROP?

13 How do you enforce referential integrity?

Referential integrity means every value in a child column that references another table always points at a row that actually exists (or is NULL). You enforce it declaratively with a FOREIGN KEY constraint: it links a child column to the primary key (or a unique key) of the parent table, and the engine refuses any INSERT or UPDATE that would leave a child pointing at a parent that is not there.

The constraint also decides what happens to child rows when the parent row is deleted or its key is updated. You pick a referential action per direction with ON DELETE and ON UPDATE:

  • NO ACTION (the default) blocks the delete or update if any child still references the parent, and raises an error.
  • CASCADE propagates the change: deleting a parent deletes its children, updating the parent key rewrites the child keys.
  • SET NULL sets the child foreign key column to NULL (the column must be nullable).
  • SET DEFAULT sets the child column to its DEFAULT value, which must itself reference an existing parent.

Always enforce this in the database, not just in application code. The database is the one gatekeeper every client, ETL job, ad hoc script and future service must pass through, so a constraint is the only rule that cannot be bypassed by a bug, a race condition, or a direct SQL edit. Application checks are best effort duplicates of a guarantee the engine can give you for free.

When a write breaks the rule SQL Server raises error 547, the constraint conflict. If you build cascading chains that could reach the same table by more than one path, defining the constraint fails with error 1785 (multiple cascade paths); the fix is to make some paths NO ACTION and handle them in a trigger or in code. See SQL constraints explained for how FKs sit alongside CHECK, UNIQUE and DEFAULT.

Finally, understand the cost of not trusting your constraints. Loading data with WITH NOCHECK (or adding a constraint with WITH NOCHECK) leaves the constraint not trusted: existing rows are never validated, and the query optimizer can no longer assume the relationship holds, so it skips optimizations that a trusted constraint would allow. Re-validate with WITH CHECK CHECK CONSTRAINT so the constraint becomes trusted again.

Departmentsdept_id PKdept_nameEmployeesemp_id PKdept_id FKEmployees.dept_id (FK) references Departments.dept_id (PK)
FOREIGN KEY with ON DELETE CASCADE
CREATE TABLE Orders (
    order_id   INT PRIMARY KEY,
    customer_id INT NOT NULL,
    total      DECIMAL(10,2) NOT NULL,
    CONSTRAINT FK_Orders_Customers
        FOREIGN KEY (customer_id)
        REFERENCES Customers (customer_id)
        ON DELETE CASCADE
        ON UPDATE NO ACTION
);

Deleting a customer automatically deletes the orders for that customer; an insert with an unknown customer_id fails with error 547.

ON DELETE SET NULL to keep the child row
CREATE TABLE Employees (
    emp_id     INT PRIMARY KEY,
    name       NVARCHAR(100) NOT NULL,
    dept_id    INT NULL,
    CONSTRAINT FK_Employees_Departments
        FOREIGN KEY (dept_id)
        REFERENCES Departments (dept_id)
        ON DELETE SET NULL
);

Deleting a department leaves its employees in place but sets their dept_id to NULL, so the column must be nullable.

Follow-up questions to expect
  • What is the difference between ON DELETE CASCADE and ON DELETE SET NULL?
  • Why might CREATE TABLE fail with error 1785, and how do you resolve multiple cascade paths?
  • What does it mean for a foreign key to be not trusted, and how does that affect the optimizer?
  • When would you enforce a relationship in application code instead of a constraint?

Related: What is a foreign key and what does it enforce? · How do you model a one to many relationship? · What is the difference between a soft delete and a hard delete?

14 When should a schema be normalized versus denormalized?

The decision comes down to whether the workload is write-heavy or read-heavy. Normalization is the process of splitting data into focused tables so that each fact is stored exactly once, with relationships expressed through foreign keys. Denormalization deliberately reverses some of that, folding related data back together or duplicating a column so that common reads touch fewer tables.

For OLTP (online transaction processing) systems - orders, payments, user accounts, anything that is updated constantly - the default is to normalize, usually to third normal form (3NF). When every fact lives in one place, an update, insert or delete changes a single row and there is no chance of two copies disagreeing. This is what protects you from update, insertion and deletion anomalies and keeps write integrity high. Storing a customer address once means correcting it once, not hunting down every order that copied it.

  • Normalize (default, aim for 3NF) when the system writes often, correctness matters, and redundancy would create anomalies. This covers most transactional applications.
  • Denormalize selectively when the system is read-mostly - reporting dashboards, analytics, OLAP warehouses - and the cost of many joins per query outweighs the cost of keeping duplicated data in sync.

Denormalization is a controlled trade-off, not a shortcut. When you copy a value or precompute an aggregate, you take on the responsibility of keeping every copy consistent, typically through triggers, application logic, or a scheduled batch reload. That is acceptable in a data warehouse that is loaded in controlled batches and rarely updated in place, which is exactly why analytical models such as a star schema keep dimensions denormalized for speed.

The professional discipline is to measure before you denormalize. Start normalized, look at the actual execution plans and slow queries, and only introduce redundancy where profiling proves that joins are the real bottleneck and indexing cannot fix it. Denormalizing on a hunch trades away integrity for a performance gain you never confirmed. For the underlying normal forms and anomalies, see the normalization questions.

Orders (unnormalized)order_idcustomer_nameproductpriceCustomerscustomer_id PKnameProductsproduct_id PKpriceOrdersorder_id PKcustomer_id FK3NF: each fact stored once, no transitive dependencies
Normalized (3NF): join to resolve the category name at read time
SELECT o.order_id,
       p.product_name,
       c.category_name,
       o.quantity
FROM Orders AS o
INNER JOIN Products AS p
    ON p.product_id = o.product_id
INNER JOIN Categories AS c
    ON c.category_id = p.category_id;

category_name is stored once in Categories; every read joins through Products to fetch it, so a rename updates a single row.

Denormalized: category_name copied onto Products to skip a join
-- category_name is duplicated onto Products for fast reads
SELECT o.order_id,
       p.product_name,
       p.category_name,
       o.quantity
FROM Orders AS o
INNER JOIN Products AS p
    ON p.product_id = o.product_id;

-- Cost of the redundancy: you must keep the copy in sync
UPDATE Products
SET    category_name = 'Home & Kitchen'
WHERE  category_id = 7;

One fewer join per read, but renaming a category now means updating every affected Products row, not one Categories row.

Follow-up questions to expect
  • What update, insertion and deletion anomalies does normalization prevent?
  • How do you keep denormalized copies of data in sync with the source of truth?
  • Would you denormalize an OLTP table just because a report is slow, or fix it another way first?
  • What does third normal form (3NF) require, and when is it worth going further to BCNF?

Related: What is normalization and why do we do it? · When is it acceptable to denormalize? · What is the difference between a star schema and a snowflake schema?

15 How do you model audit history or change tracking?

Audit history means keeping a durable record of how a row changed over time: the previous values, the new values, the operation, and the who and when. The current table alone cannot answer that, because an UPDATE or DELETE overwrites the past. The question is really about where you capture the change and how much of that plumbing you write yourself.

There are three common designs, from most manual to most automatic:

  • History table with triggers. Create a companion table (for example Orders_Audit) and attach AFTER INSERT, UPDATE and DELETE triggers that copy the affected rows into it. Inside a trigger the special inserted and deleted tables give you the new and old versions, so you can store both alongside the operation, SUSER_SNAME(), and SYSUTCDATETIME(). Maximum control, but you own the code and it can slow down heavy writes.
  • System-versioned temporal tables. SQL Server (2016+) tracks history for you. You add two hidden datetime2 period columns and a linked history table, and every change is versioned automatically. You then query the past with FOR SYSTEM_TIME. Little code, but you get a full row copy per version, so storage grows.
  • Change Data Capture (CDC). An asynchronous log reader records inserts, updates and deletes into system change tables, mainly to feed ETL and downstream systems rather than to serve as a human-readable audit log.

The trade-offs are storage, performance, and how much you must maintain. Triggers are flexible and let you record exactly the columns and metadata you care about, but they run inside the writing transaction and add overhead. Temporal tables are clean and reliable but keep a full copy of every version. CDC is decoupled and low-impact on writers but is heavier to set up and is aimed at data movement.

Audit history is closely related to soft deletes: both are ways to avoid losing information, and a well designed audit table also keeps the correct keys and constraints so you can trace a change back to the exact row it came from. See the Microsoft docs on temporal tables for the automatic option.

INSERTon OrdersTriggerfireswrite rowto AuditA data change automatically runs trigger logic
AFTER UPDATE trigger that writes old and new values to an audit table
-- Audit table captures the change plus who and when
CREATE TABLE Orders_Audit (
    audit_id    BIGINT IDENTITY PRIMARY KEY,
    order_id    INT           NOT NULL,
    old_status  VARCHAR(20)   NULL,
    new_status  VARCHAR(20)   NULL,
    changed_by  SYSNAME       NOT NULL,
    changed_at  DATETIME2     NOT NULL
);
GO

CREATE TRIGGER trg_Orders_Audit
ON Orders
AFTER UPDATE
AS
BEGIN
    SET NOCOUNT ON;
    INSERT INTO Orders_Audit
        (order_id, old_status, new_status, changed_by, changed_at)
    SELECT d.order_id, d.status, i.status,
           SUSER_SNAME(), SYSUTCDATETIME()
    FROM inserted AS i
    JOIN deleted  AS d ON d.order_id = i.order_id
    WHERE i.status <> d.status;   -- only log real changes
END;

The deleted table holds the pre-update row and inserted holds the new one, so you can store both sides of the change.

System-versioned temporal table (history kept automatically)
CREATE TABLE Orders (
    order_id   INT PRIMARY KEY,
    status     VARCHAR(20)  NOT NULL,
    total      DECIMAL(10,2) NOT NULL,
    valid_from DATETIME2 GENERATED ALWAYS AS ROW START NOT NULL,
    valid_to   DATETIME2 GENERATED ALWAYS AS ROW END   NOT NULL,
    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
)
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_History));

-- Later: see what every order looked like at a point in time
SELECT order_id, status, total
FROM Orders
FOR SYSTEM_TIME AS OF '2026-01-01T00:00:00';

SQL Server writes prior versions to Orders_History on every change; FOR SYSTEM_TIME queries the past with no extra triggers.

Follow-up questions to expect
  • How do you capture the old row values inside an AFTER UPDATE trigger?
  • What are the storage and performance costs of temporal tables versus trigger-based auditing?
  • How would you record which application user made a change, not just the database login?
  • When would you choose Change Data Capture over a hand-written audit table?

Related: What is a trigger and when should you use one? · What is the difference between a soft delete and a hard delete? · How do you enforce referential integrity?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.