Home › SQL Interview Questions › Design

SQL Interview Prep

Normalization Interview Questions

Q: What is normalization and why do we do it?

Normalization is the process of organizing tables so each fact is stored exactly once, reducing redundancy and dependency. We do it to prevent update, insert and delete anomalies and keep data consistent, applying normal forms like 1NF, 2NF and 3NF.

Q: What is first normal form (1NF)?

A table is in first normal form when every column holds a single atomic value (no repeating groups, comma-separated lists, or arrays) and every row is unique because a key exists. You fix a multi-valued column by moving it into a related table, one value per row.

Q: What is second normal form (2NF)?

A table is in second normal form when it is already in 1NF and has no partial dependency, meaning no non-key column depends on only part of a composite primary key. It only matters when the key is composite; you fix a violation by moving the partially dependent columns into their own table.

Q: What is third normal form (3NF)?

A table is in third normal form when it is already in 2NF and has no transitive dependency, meaning no non-key column depends on another non-key column. Every non-key column must depend on the key, the whole key, and nothing but the key.

Q: What is Boyce Codd normal form (BCNF)?

BCNF is a stricter version of 3NF that requires the left side of every functional dependency X -> Y to be a superkey. It removes the rare anomalies 3NF still allows when a table has overlapping candidate keys and a determinant that is not itself a candidate key.

Q: What is a functional dependency?

A functional dependency X -> Y means every value of X determines exactly one value of Y, so X functionally determines Y. Functional dependencies are the foundation of normalization: they reveal which columns depend on the key and which depend on the wrong thing.

Q: What is a partial dependency?

A partial dependency is when a non-key column depends on only part of a composite primary key instead of the whole key. It can only happen when the key is composite, and removing it is exactly what Second Normal Form (2NF) requires.

Q: What are update, insertion and deletion anomalies?

Three data problems caused by redundancy in an unnormalized table: an update anomaly lets the same repeated fact go inconsistent, an insertion anomaly blocks adding a fact without unrelated data, and a deletion anomaly loses a separate fact when a row is removed. Normalization removes all three.

Q: What is the difference between normalization and denormalization?

Normalization splits data into focused tables to remove redundancy and update anomalies, which favors safe writes and integrity. Denormalization deliberately reintroduces redundancy, such as duplicated or precomputed columns, to cut joins and speed up read-heavy workloads at the cost of keeping copies in sync.

Normalization questions test whether you understand why we split tables. They cover the normal forms, functional dependencies, update anomalies and the trade-offs of denormalization. Each answer shows the before-and-after tables in a diagram with SQL.

Design 15 questions

1 What is normalization and why do we do it?

Normalization is the process of structuring a relational schema so that data is broken into focused tables and each fact is stored in exactly one place. The goal is to reduce redundancy (the same value repeated in many rows) and remove undesirable dependencies, so that every non-key column depends on the key, the whole key, and nothing but the key.

We normalize primarily to protect data integrity. When a fact is duplicated across rows, the database can drift into contradictory states. These problems are known as anomalies:

Update anomaly: a value like a customer address is stored on every order row, so changing it means updating many rows and risking that some are missed.
Insert anomaly: you cannot record a new fact (a product with no orders yet) because unrelated columns would be left empty or invented.
Delete anomaly: removing the last order for a customer also erases the only copy of that customer.

Normalization is achieved by progressively applying the normal forms. 1NF requires atomic values and no repeating groups. 2NF removes partial dependencies on part of a composite key. 3NF removes transitive dependencies, where a non-key column depends on another non-key column. BCNF is a stricter version of 3NF for edge cases with overlapping candidate keys. See the normalization guide and this worked walkthrough for the full progression.

The trade-off is read performance. A highly normalized schema spreads data across many tables, so answering a query can require several JOIN operations. For read-heavy analytics or reporting workloads teams sometimes deliberately reverse course with denormalization, accepting controlled redundancy in exchange for fewer joins and faster reads.

Redundant table: customer details repeated on every order row

CREATE TABLE Orders (
    order_id       INT PRIMARY KEY,
    order_date     DATE,
    customer_name  VARCHAR(100),
    customer_email VARCHAR(100),
    customer_city  VARCHAR(60)
);

-- The same customer_name, customer_email and customer_city
-- are duplicated for every order that customer places.
-- Change one email and you must update every matching row.

Customer facts are duplicated across order rows, which invites update, insert and delete anomalies.

Normalized split: store each customer fact once, reference by key

CREATE TABLE Customers (
    customer_id    INT PRIMARY KEY,
    customer_name  VARCHAR(100),
    customer_email VARCHAR(100),
    customer_city  VARCHAR(60)
);

CREATE TABLE Orders (
    order_id    INT PRIMARY KEY,
    order_date  DATE,
    customer_id INT NOT NULL
        REFERENCES Customers(customer_id)
);

Each customer is stored once in Customers; Orders points at it by customer_id, so an email change is a single-row update.

Follow-up questions to expect

Can you walk through taking a table from 1NF to 2NF to 3NF?
When would you deliberately denormalize a schema?
What is the difference between 3NF and BCNF?
How do foreign keys help enforce the integrity that normalization sets up?

2 What is first normal form (1NF)?

First normal form (1NF) is the first step of database normalization. A table satisfies 1NF when two conditions hold: every column stores a single atomic value, and every row is uniquely identifiable because the table has a key.

Atomic means one value per column per row. A column that packs several values into one cell breaks 1NF, whether the values are a comma-separated list, a repeating group of columns like phone1, phone2, phone3, or an array. The moment you would need to split a cell to answer a question ("which customers have a mobile number?"), the design is not in 1NF.

Single value per cell: no lists, no arrays, no delimited strings.
No repeating groups: do not model "many of a thing" by adding numbered columns.
A key exists: a primary key (or equivalent) makes each row unique so a value can always be addressed by key plus column name.

The fix for a multi-valued column is to split it into separate rows in a related child table. Each value from the offending cell becomes its own row, linked back to the parent by a foreign key. This turns one crowded column into a clean one-to-many relationship that you can index, filter, and join.

1NF makes data queryable with plain SQL: you can count, filter, and join individual values instead of parsing strings. It is the foundation the higher forms (2NF and beyond) build on.

Not in 1NF: a comma-separated phones column

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    name        NVARCHAR(100),
    phones      NVARCHAR(200)  -- e.g. '555-0100, 555-0101, 555-0102'
);

INSERT INTO Customers (customer_id, name, phones)
VALUES (1, 'Ada Lovelace', '555-0100, 555-0101, 555-0102');

The phones column packs several numbers into one cell, so it violates 1NF. You cannot easily count or search a single number without parsing the string.

In 1NF: one phone per row in a related child table

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    name        NVARCHAR(100)
);

CREATE TABLE Phones (
    phone_id    INT PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES Customers(customer_id),
    phone       NVARCHAR(20) NOT NULL
);

INSERT INTO Phones (phone_id, customer_id, phone)
VALUES (1, 1, '555-0100'),
       (2, 1, '555-0101'),
       (3, 1, '555-0102');

Each phone number is now atomic and lives in its own row, linked to the customer by a foreign key. This one-to-many design is in 1NF and is easy to query and index.

Follow-up questions to expect

How would you write a query to list every phone number for a given customer after the split?
What problems does storing a comma-separated list cause for indexing and constraints?
Once a table is in 1NF, what does second normal form (2NF) add?
How do you decide whether a value is truly atomic for your application?

3 What is second normal form (2NF)?

Second normal form (2NF) is the second step in database normalization. A table is in 2NF when two things are true: it is already in first normal form (1NF), and it has no partial dependency. A partial dependency means a non-key column depends on only part of a composite primary key rather than the whole key.

The key insight is that 2NF is only relevant when the primary key is composite (made of two or more columns). If a table has a single-column primary key, there is no "part" of the key to depend on, so the table is automatically in 2NF once it reaches 1NF.

Consider an OrderItems table keyed on the pair (order_id, product_id). A column like quantity correctly depends on the whole key: the quantity is specific to one product on one order. But a column like product_name depends on product_id alone, not on order_id. That is a partial dependency, and it violates 2NF.

Redundancy: the product name is repeated on every order line for that product.
Update anomaly: renaming a product means updating many rows, and missing one leaves the data inconsistent.
Insert anomaly: you cannot record a product that has never been ordered, because there is no order_id to complete the key.

You fix the violation by moving the partially dependent columns into a new table keyed by the part they depend on. Here product_name moves to a Products table keyed on product_id, and OrderItems keeps only quantity, which depends on the full key. Removing partial dependencies is exactly what separates 2NF from a raw 1NF table.

A 2NF violation: product_name depends on only part of the key

-- Composite primary key is (order_id, product_id)
CREATE TABLE OrderItems (
    order_id     INT     NOT NULL,
    product_id   INT     NOT NULL,
    product_name VARCHAR(100) NOT NULL,  -- depends on product_id only
    quantity     INT     NOT NULL,       -- depends on the whole key
    CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id)
);

product_name is a partial dependency: it is fixed by product_id and gets repeated on every order line for that product.

Normalized to 2NF: split the partial dependency into Products

-- product_name now lives where it belongs, keyed by product_id alone
CREATE TABLE Products (
    product_id   INT          NOT NULL,
    product_name VARCHAR(100) NOT NULL,
    CONSTRAINT PK_Products PRIMARY KEY (product_id)
);

CREATE TABLE OrderItems (
    order_id   INT NOT NULL,
    product_id INT NOT NULL,
    quantity   INT NOT NULL,  -- depends on the full (order_id, product_id) key
    CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id),
    CONSTRAINT FK_OrderItems_Products
        FOREIGN KEY (product_id) REFERENCES Products (product_id)
);

Each product name is stored once. OrderItems keeps only columns that depend on the entire composite key.

Follow-up questions to expect

Why is 2NF only a concern for tables with a composite primary key?
What is the difference between a partial dependency and a transitive dependency?
How does third normal form (3NF) build on 2NF?
Can you give an example of an update anomaly that 2NF removes?

4 What is third normal form (3NF)?

Third normal form (3NF) is the next step in database normalization after second normal form. A table is in 3NF when two things are true: it is already in 2NF, and it has no transitive dependency. A transitive dependency is when a non-key column depends on another non-key column instead of depending directly on the primary key.

The classic memory aid is that in 3NF every non-key column must depend on the key, the whole key, and nothing but the key. The "whole key" part is handled by 2NF (no partial dependency on part of a composite key). The "nothing but the key" part is what 3NF adds: a non-key column is not allowed to be determined by some other non-key column.

A simple example: an Employees table with columns emp_id (the key), dept_id, and dept_name. Here dept_name depends on dept_id, and dept_id is not the key, so emp_id -> dept_id -> dept_name is a transitive dependency. This causes update anomalies: renaming a department means updating many employee rows, and a department with no employees cannot be recorded at all.

The fix is to move the transitively dependent columns into their own table. Keep dept_id as a foreign key in Employees, and put dept_id and dept_name in a separate Departments table where dept_name now depends directly on that table primary key. A deeper walkthrough of normalization shows how each form builds on the last.

2NF required first: no non-key column may depend on only part of a composite key.
No transitive dependency: no non-key column may depend on another non-key column.
Benefit: removing transitive dependencies eliminates redundant data and the insert, update, and delete anomalies it causes.
Stricter cousin: when even a candidate key can determine part of another key, you may need BCNF.

Not in 3NF: dept_name transitively depends on dept_id

CREATE TABLE Employees (
    emp_id     INT          PRIMARY KEY,
    emp_name   NVARCHAR(100) NOT NULL,
    dept_id    INT          NOT NULL,
    dept_name  NVARCHAR(100) NOT NULL  -- depends on dept_id, not on emp_id
);

emp_id -> dept_id -> dept_name is a transitive dependency, so dept_name is duplicated across every employee in the same department.

In 3NF: split the transitive dependency into its own table

CREATE TABLE Departments (
    dept_id    INT          PRIMARY KEY,
    dept_name  NVARCHAR(100) NOT NULL  -- now depends directly on the key
);

CREATE TABLE Employees (
    emp_id     INT          PRIMARY KEY,
    emp_name   NVARCHAR(100) NOT NULL,
    dept_id    INT          NOT NULL
        REFERENCES Departments(dept_id)
);

dept_name lives in Departments and is stored once; Employees keeps dept_id as a foreign key. No non-key column depends on another non-key column.

Follow-up questions to expect

What is a transitive dependency and how do you spot one?
How does BCNF differ from 3NF?
Can a table be in 2NF but not in 3NF? Give an example.
When would you deliberately denormalize a 3NF design?

5 What is Boyce Codd normal form (BCNF)?

Boyce Codd normal form (BCNF) is a stronger form of database normalization than third normal form (3NF). The rule is short: for every non trivial functional dependency X -> Y, the determinant X must be a superkey of the table. In other words, every determinant has to be a key.

3NF already forbids a non key column from depending on another non key column, but it makes an exception: it still allows a dependency whose right side is part of a candidate key (a prime attribute). BCNF closes that loophole. The two forms only differ for the uncommon case of a table with multiple overlapping candidate keys where some determinant is not a candidate key.

The classic example is a table (student, subject, teacher) under two rules: each teacher teaches exactly one subject, and for a given subject a student is assigned one teacher. Here teacher -> subject holds, but teacher is not a candidate key, so the table is in 3NF yet violates BCNF. That lets update anomalies creep back in: you cannot record which subject a teacher covers until a student enrols, and changing a teacher subject means editing many rows.

Fix: decompose the table so that every determinant becomes a key of its own table.
Split (student, subject, teacher) into Teaches(teacher, subject) and Enrolment(student, teacher).
Now teacher is the key of Teaches, so teacher -> subject no longer breaks the rule.

One caveat interviewers like to hear: a BCNF decomposition is always lossless, but it is not always dependency preserving, so occasionally a design stays at 3NF on purpose to keep a functional dependency enforceable with a single key. See also 3NF vs BCNF.

3NF but not BCNF: teacher determines subject in one table

CREATE TABLE StudentSubjectTeacher (
    student  VARCHAR(50) NOT NULL,
    subject  VARCHAR(50) NOT NULL,
    teacher  VARCHAR(50) NOT NULL,
    -- candidate keys: (student, subject) and (student, teacher)
    CONSTRAINT PK_SST PRIMARY KEY (student, subject),
    CONSTRAINT UQ_SST UNIQUE (student, teacher)
);
-- teacher -> subject holds, but teacher is not a superkey: this violates BCNF.

The dependency teacher -> subject is legal in 3NF because subject is a prime attribute, yet it still causes update anomalies.

BCNF decomposition: every determinant is now a key

CREATE TABLE Teaches (
    teacher  VARCHAR(50) NOT NULL,
    subject  VARCHAR(50) NOT NULL,
    CONSTRAINT PK_Teaches PRIMARY KEY (teacher)   -- teacher -> subject
);

CREATE TABLE Enrolment (
    student  VARCHAR(50) NOT NULL,
    teacher  VARCHAR(50) NOT NULL,
    CONSTRAINT PK_Enrolment PRIMARY KEY (student, teacher),
    CONSTRAINT FK_Enrolment_Teacher FOREIGN KEY (teacher)
        REFERENCES Teaches(teacher)
);

Splitting on the offending determinant makes teacher the key of Teaches, so every functional dependency now has a superkey on its left side.

Follow-up questions to expect

How exactly does BCNF differ from 3NF, and when do they coincide?
What is a lossless join decomposition, and why does it matter here?
Can a BCNF decomposition ever lose a functional dependency? Give an example.
Would you always normalize to BCNF in a production schema? Why or why not?

6 What is a functional dependency?

A functional dependency, written X -> Y, states that each value of the attribute set X determines exactly one value of the attribute set Y. We say X functionally determines Y, or that Y is functionally dependent on X. If two rows agree on X they must also agree on Y. For example student_id -> student_name: given a student id there is only ever one name for it, while the reverse does not hold because two students can share a name.

The left side X is called the determinant. Determinants are exactly how keys are defined: a candidate key is a minimal set of columns that functionally determines every other column in the table. Dependencies where the determinant is not a candidate key are what normalization works to remove, because they let the same fact be stored (and updated inconsistently) in many rows.

Three shapes of dependency drive the normal forms:

Full functional dependency: Y depends on the whole of a composite key and not on any part of it. This is the target state that 2NF requires.
Partial dependency: Y depends on only part of a composite key, for example (order_id, product_id) -> product_name where product_name really depends on product_id alone. Removing partial dependencies gives second normal form.
Transitive dependency: a non-key column determines another non-key column, for example student_id -> dept_id and dept_id -> dept_name, so student_id -> dept_name holds indirectly. Removing transitive dependencies gives third normal form.

Functional dependencies are a property of the meaning of the data, not of the rows that happen to be present. You cannot prove a dependency by looking at a sample; you assert it from the business rules and then design tables so that every determinant is a key. Get the dependencies right and the normal forms follow almost mechanically.

A clean design where the determinant is the key

CREATE TABLE Students (
    student_id  INT         NOT NULL PRIMARY KEY,
    student_name NVARCHAR(100) NOT NULL,
    dept_id     INT         NOT NULL
);
-- student_id -> student_name and student_id -> dept_id.
-- The determinant (student_id) is the primary key, so each
-- fact about a student is stored exactly once.

Every non-key column depends fully on the key, which is what the higher normal forms require.

A violation: a partial dependency in a composite key

CREATE TABLE OrderLines (
    order_id     INT NOT NULL,
    product_id   INT NOT NULL,
    product_name NVARCHAR(100) NOT NULL,
    quantity     INT NOT NULL,
    PRIMARY KEY (order_id, product_id)
);
-- product_name depends only on product_id, not the whole key.
-- product_id -> product_name is a partial dependency, so the
-- product name is repeated on every order line for that product.

The fix is to split product_name into a Products table keyed by product_id, which reaches 2NF.

Follow-up questions to expect

How does a partial dependency differ from a transitive dependency?
What is the relationship between a determinant and a candidate key?
Which normal form removes transitive dependencies, and why?
Can you infer a functional dependency just by looking at the current rows?

7 What is a transitive dependency?

A transitive dependency exists when a non-key column depends on the primary key only through another non-key column, rather than directly on the key. In functional dependency terms, if A -> B and B -> C, then A -> C holds indirectly (transitively) even though C does not depend on A directly. Here A is the key and both B and C are non-key columns.

A classic example is an Orders table keyed on order_id. The order determines which employee handled it (order_id -> employee_id), and the employee determines their own name (employee_id -> employee_name). So order_id -> employee_name holds only through employee_id. The name is stored in the wrong table: it describes the employee, not the order.

The key is order_id.
employee_id is a non-key column that depends on the key.
employee_name is a non-key column that depends on employee_id, not on the key, so it is transitively dependent.

Transitive dependencies are exactly what third normal form (3NF) forbids. A table is in 3NF when it is already in second normal form and every non-key column depends on the key directly, not through another non-key column. Storing employee_name in Orders causes update anomalies: if the employee changes their name you must rewrite it on every one of their orders, and an order-less employee cannot be recorded at all. See database normalization for the full progression.

The fix is to split the middle dependency into its own table. Move employee_name into an Employees table keyed on employee_id, and keep only the employee_id foreign key on Orders. Now every non-key column in each table depends on that table's key and nothing else.

Before: a table with a transitive dependency

CREATE TABLE Orders (
    order_id      INT          PRIMARY KEY,
    order_date    DATE         NOT NULL,
    employee_id   INT          NOT NULL,
    employee_name VARCHAR(100) NOT NULL
);
-- order_id -> employee_id -> employee_name
-- employee_name depends on the key only through employee_id.

employee_name is transitively dependent on order_id, so the same name is repeated across every order that employee handled.

After: split into 3NF

CREATE TABLE Employees (
    employee_id   INT          PRIMARY KEY,
    employee_name VARCHAR(100) NOT NULL
);

CREATE TABLE Orders (
    order_id    INT  PRIMARY KEY,
    order_date  DATE NOT NULL,
    employee_id INT  NOT NULL
        REFERENCES Employees(employee_id)
);

The middle dependency now lives in Employees; Orders keeps only the employee_id foreign key, so every non-key column depends directly on its key.

Follow-up questions to expect

How does a transitive dependency differ from a partial dependency?
What update anomalies does a transitive dependency cause?
Give the formal definition of third normal form.
Is a surrogate key enough to put a table in 3NF on its own?

8 What is a partial dependency?

A partial dependency exists when a non-key column is functionally determined by only part of a composite primary key rather than the whole key. In other words, you could work out that column value from just one component of the key, so it does not truly belong in that table.

The important precondition: a partial dependency is only possible when the primary key is composite (made of two or more columns). If the key is a single column, every non-key attribute depends on the whole key by definition, so no partial dependency can exist. This is why single-column-key tables that are already in First Normal Form automatically satisfy 2NF.

Consider an OrderItems table with the composite key (order_id, product_id). A column like quantity depends on the full key, that is correct. But product_name depends only on product_id, one half of the key. That is a partial dependency: the product name is repeated on every order line for that product and can be updated inconsistently.

Symptom: a column that changes only when one part of the key changes, and is duplicated across many rows.
Why it is bad: it causes update, insertion, and deletion anomalies (the classic redundancy problems).
The fix: move the partially dependent column into its own table keyed by the attribute it really depends on. Here, product_name moves to a Products table keyed by product_id.

Eliminating partial dependencies is precisely what Second Normal Form (2NF) does. A table is in 2NF when it is in 1NF and every non-key column depends on the whole primary key, not just a part of it. Do not confuse this with a transitive dependency (non-key column depending on another non-key column), which is what Third Normal Form removes.

The problem: a partial dependency inside the composite key

CREATE TABLE OrderItems (
    order_id     INT        NOT NULL,
    product_id   INT        NOT NULL,
    quantity     INT        NOT NULL,
    product_name NVARCHAR(100) NOT NULL,  -- depends only on product_id
    CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id)
);

product_name depends on only part of the key (product_id), so it repeats on every order line for that product.

The fix: split so every column depends on the whole key (2NF)

CREATE TABLE Products (
    product_id   INT           NOT NULL PRIMARY KEY,
    product_name NVARCHAR(100) NOT NULL
);

CREATE TABLE OrderItems (
    order_id   INT NOT NULL,
    product_id INT NOT NULL,
    quantity   INT NOT NULL,
    CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id),
    CONSTRAINT FK_OrderItems_Products
        FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

product_name now lives once in Products; OrderItems keeps only columns that depend on the full (order_id, product_id) key.

Follow-up questions to expect

How is a partial dependency different from a transitive dependency?
Can a table with a single-column primary key ever have a partial dependency?
Which normal form removes partial dependencies, and what does it require?
What anomalies does an unresolved partial dependency cause?

9 What are update, insertion and deletion anomalies?

When a single wide table stores several unrelated facts together, the same fact ends up repeated across many rows. That redundancy is what causes the three classic modification anomalies. They are the standard motivation an interviewer expects you to give for normalization: splitting one overloaded table into focused tables so each fact is stored exactly once.

Picture an Orders table that repeats the customer name, the customer city and the product price on every single order line. The data model works, but it is fragile in three specific ways.

Update anomaly: a fact stored in many rows must be changed in every row at once. If a customer moves city and you update only some of their order rows, the table now disagrees with itself and the data is inconsistent.
Insertion anomaly: you cannot record one fact without also supplying unrelated data. You cannot add a brand new product to the catalog until someone actually orders it, because a product only exists as part of an order row.
Deletion anomaly: deleting a row throws away a second, independent fact by accident. Deleting the last order for a customer also erases that customer entirely, so you lose their address even though you never meant to.

The cure is the same in every case: normalize the design so each fact lives in one place. Customers go in a Customers table, products in a Products table, and the Orders table keeps only foreign keys that point at them. A customer city is then stored once, a product can exist with zero orders, and deleting an order cannot remove a customer.

In an interview it is worth naming which normal form each fix relates to. Repeating groups are removed by first and second normal form, while facts that depend on a non-key column are removed by third normal form. All three anomalies disappear once every non-key column depends on the key, the whole key and nothing but the key.

The problem: one wide table with repeated facts

CREATE TABLE Orders (
    order_id      INT,
    customer_name VARCHAR(100),
    customer_city VARCHAR(100),
    product_name  VARCHAR(100),
    unit_price    DECIMAL(10,2)
);

-- Customer city is repeated on every order row.
-- Moving a customer means updating many rows (update anomaly).
UPDATE Orders
SET customer_city = 'Seattle'
WHERE customer_name = 'Ada Lovelace';

-- Miss a row and the table contradicts itself:
-- some rows say Seattle, some still say the old city.

A new product cannot be added until it is ordered (insertion anomaly), and deleting a customer last order loses their city (deletion anomaly).

The fix: normalized tables, each fact stored once

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    name        VARCHAR(100),
    city        VARCHAR(100)
);

CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    name       VARCHAR(100),
    unit_price DECIMAL(10,2)
);

CREATE TABLE Orders (
    order_id    INT PRIMARY KEY,
    customer_id INT REFERENCES Customers(customer_id),
    product_id  INT REFERENCES Products(product_id)
);

-- City lives in one row now: one update, no inconsistency.
UPDATE Customers SET city = 'Seattle' WHERE customer_id = 1;

A product can exist with no orders, a customer survives deletion of their orders, and each fact changes in exactly one place.

Follow-up questions to expect

Which normal form removes each of these anomalies?
Can denormalization ever be the right choice despite these anomalies?
How do foreign keys help prevent inconsistent data after normalizing?
What is a transitive dependency and how does it relate to the update anomaly?

10 What is the difference between normalization and denormalization?

Normalization is the process of organizing columns and tables so that each fact is stored in exactly one place. You split a wide table into smaller focused tables linked by keys, progressing through the normal forms (1NF, 2NF, 3NF and beyond). The goal is to remove redundancy and the update, insert and delete anomalies that redundancy causes. See the normalization guide for a full walkthrough.

Denormalization is the deliberate reverse step: you add controlled redundancy back into a normalized schema. That can mean duplicating a column across tables, storing a precomputed or aggregated value, or merging tables that are almost always queried together. The point is to avoid expensive joins and aggregations at read time.

Normalization optimizes writes and integrity. One fact lives in one row, so an update touches a single place and the database cannot drift into contradictory states.
Denormalization optimizes reads. The data a query needs is already sitting together, so fewer joins and no repeated aggregation are needed to serve it.
Denormalization shifts the burden to write time. Every duplicated or precomputed value must be kept in sync, usually through application logic, triggers or a scheduled refresh, which adds complexity and risk of stale data.

The right choice depends on the workload. OLTP systems, where many concurrent transactions insert and update small amounts of data, lean toward a normalized design so writes stay cheap and consistent. OLAP and reporting systems, where large read-mostly queries scan and aggregate history, often denormalize (for example into star schemas or summary tables) so dashboards stay fast.

In interviews the key point is that these are not opposites you pick once. You normalize first to get a correct model, then denormalize selectively and only where a measured read bottleneck justifies the cost of maintaining the duplicated data.

Normalized: join to fetch each order line with its product name

SELECT ol.order_id, p.name AS product, ol.quantity, ol.unit_price
FROM OrderLines AS ol
INNER JOIN Products AS p
    ON p.product_id = ol.product_id;

The product name lives once in Products; the order line stores only product_id and joins to read the name.

Denormalized: a precomputed line_total avoids recomputing on every read

ALTER TABLE OrderLines
    ADD line_total AS (quantity * unit_price) PERSISTED;

SELECT order_id, SUM(line_total) AS order_total
FROM OrderLines
GROUP BY order_id;

A PERSISTED computed column stores quantity * unit_price on disk, so read queries skip the arithmetic; SQL Server keeps it in sync automatically.

Follow-up questions to expect

When would you choose to denormalize a schema, and how do you keep the copies in sync?
What is the difference between an OLTP and an OLAP workload in terms of schema design?
What are update, insert and delete anomalies, and how does normalization remove them?
How does a star schema in a data warehouse relate to denormalization?

11 When is it acceptable to denormalize?

Denormalization is the deliberate introduction of redundancy into a schema to make reads faster. It is acceptable, but only as an optimization you reach for after measuring, never as a default. Start from a properly normalized design, prove with query plans and timings that specific joins or aggregations are the bottleneck, and then denormalize the narrow spot that hurts.

The classic trigger is a read-heavy or reporting workload: the same expensive multi-table join or GROUP BY runs thousands of times, while writes are comparatively rare. Trading a little redundancy for far fewer joins at read time is a good deal when reads vastly outnumber writes. On a write-heavy OLTP path the trade is usually the wrong way round, because every write now has more copies to update.

Common techniques, from lightest to heaviest:

Precomputed aggregates / summary tables - store the daily or per-customer totals once instead of recomputing them on every query.
Duplicated lookup columns - copy a stable attribute (for example a product name or category) onto a child table to avoid a join for display.
Materialized / indexed views - let the engine maintain a persisted, indexed result set for you (an indexed view in SQL Server, a materialized view elsewhere).

Every one of these has the same price: you now own the consistency problem. The copy can drift from the truth unless you keep it in sync with triggers, scheduled refresh jobs, or application logic, and each of those adds complexity and a place for bugs. An indexed view is the safest option because the engine keeps it correct automatically, at the cost of slower writes to the base tables.

The rule that keeps this sane: keep the normalized tables as the single source of truth and treat every denormalized structure as a disposable, rebuildable cache. If a summary table is ever wrong you must be able to drop it and regenerate it from the base data. For where these ideas sit in the wider picture see performance tuning.

Summary table kept fresh by a trigger

-- Normalized source of truth stays untouched: Orders.
-- Denormalized cache: one running total per customer.
CREATE TABLE CustomerOrderTotals (
    customer_id INT PRIMARY KEY,
    order_count INT      NOT NULL DEFAULT 0,
    total_spent DECIMAL(12,2) NOT NULL DEFAULT 0
);
GO

CREATE TRIGGER trg_Orders_Rollup
ON Orders
AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON;
    MERGE CustomerOrderTotals AS t
    USING (
        SELECT customer_id, COUNT(*) AS c, SUM(total) AS s
        FROM inserted
        GROUP BY customer_id
    ) AS i
        ON t.customer_id = i.customer_id
    WHEN MATCHED THEN
        UPDATE SET order_count = t.order_count + i.c,
                   total_spent = t.total_spent + i.s
    WHEN NOT MATCHED THEN
        INSERT (customer_id, order_count, total_spent)
        VALUES (i.customer_id, i.c, i.s);
END;

Reports read one indexed row per customer instead of scanning and aggregating Orders every time; the trigger pays the consistency cost on write.

Indexed view: the engine maintains the aggregate for you

CREATE VIEW dbo.SalesByCustomer
WITH SCHEMABINDING
AS
SELECT
    customer_id,
    COUNT_BIG(*)  AS order_count,
    SUM(total)    AS total_spent
FROM dbo.Orders
GROUP BY customer_id;
GO

-- Materializing it: the unique clustered index persists the result.
CREATE UNIQUE CLUSTERED INDEX IX_SalesByCustomer
    ON dbo.SalesByCustomer (customer_id);

SQL Server keeps the indexed view in sync automatically as base rows change, so there is no drift; the trade off is extra work on every write to Orders.

Follow-up questions to expect

How do you keep a denormalized summary table consistent with its source?
What are the write-side costs of an indexed view in SQL Server?
How would you decide between a trigger, a scheduled job, and application logic to maintain a copy?
When would denormalizing actually make performance worse?

12 What is the difference between 3NF and BCNF?

Both Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF) build on the same goal: eliminate the update, insert, and delete anomalies that come from redundant data driven by functional dependencies. They differ only in how strict the rule about determinants is. See the wider topic on database normalization for the full progression from 1NF onward.

A table is in 3NF when it is already in 2NF and every non-key attribute depends on the key, the whole key, and nothing but the key. More precisely, for every functional dependency X -> Y, either X is a superkey, or Y is a prime attribute (that is, Y is part of some candidate key). That second escape clause is the loophole: 3NF permits a non-key column to determine part of a candidate key in rare cases where candidate keys overlap.

A table is in BCNF when, for every non-trivial functional dependency X -> Y, X is a superkey. There is no exception for prime attributes. In other words, every determinant (the left side of a dependency) must be a candidate key or a superset of one. This is why BCNF is sometimes called 3.5NF: it closes the exact gap that 3NF leaves open.

In the vast majority of schemas the two forms coincide: most 3NF tables are already in BCNF, so you rarely have to do extra work.
The forms diverge only when a table has multiple overlapping candidate keys that are composite, and a non-key attribute (or a prime attribute) determines part of another candidate key.
BCNF is stricter, so every BCNF table is automatically in 3NF, but not the reverse.

One important trade-off: 3NF can always be reached while keeping the decomposition both lossless and dependency preserving (every functional dependency can still be checked on a single table). BCNF sometimes cannot. Decomposing a stubborn table into BCNF can occasionally force you to sacrifice dependency preservation, meaning a constraint can only be enforced by joining tables back together. That is the practical reason some designers stop at 3NF.

The classic textbook example is a table like ClassSchedule(Student, Subject, Teacher) where each teacher teaches exactly one subject, and each student studies a subject with one teacher. Candidate keys are {Student, Subject} and {Student, Teacher}. The dependency Teacher -> Subject holds, but Teacher is not a superkey, so the table violates BCNF. It is still in 3NF because Subject is a prime attribute. See the dedicated pages on 3NF and BCNF for deeper walkthroughs.

A table that is 3NF but not BCNF

-- Rules: each Teacher teaches exactly one Subject.
-- Candidate keys: (Student, Subject) and (Student, Teacher).
-- The dependency Teacher -> Subject breaks BCNF because
-- Teacher is not a superkey, yet Subject is a prime attribute
-- so 3NF is still satisfied.
CREATE TABLE ClassSchedule (
    Student  INT       NOT NULL,
    Subject  VARCHAR(40) NOT NULL,
    Teacher  VARCHAR(40) NOT NULL,
    CONSTRAINT PK_ClassSchedule PRIMARY KEY (Student, Subject)
);

Storing Teacher -> Subject here means the same fact is repeated on every row for that teacher, the redundancy BCNF removes.

Decomposing into BCNF

-- Split so every determinant is a superkey.
CREATE TABLE TeacherSubject (
    Teacher  VARCHAR(40) NOT NULL,
    Subject  VARCHAR(40) NOT NULL,
    CONSTRAINT PK_TeacherSubject PRIMARY KEY (Teacher)
);

CREATE TABLE StudentTeacher (
    Student  INT       NOT NULL,
    Teacher  VARCHAR(40) NOT NULL,
    CONSTRAINT PK_StudentTeacher PRIMARY KEY (Student, Teacher),
    CONSTRAINT FK_StudentTeacher_Teacher
        FOREIGN KEY (Teacher) REFERENCES TeacherSubject (Teacher)
);

Now Teacher is the key of its own table, so Teacher -> Subject no longer causes redundancy. Note the original (Student, Subject) uniqueness can no longer be enforced on a single table, the dependency-preservation cost of BCNF.

Follow-up questions to expect

Can you give an example where reaching BCNF loses dependency preservation?
Why is BCNF sometimes informally called 3.5NF?
What is a prime attribute, and why does 3NF make an exception for it?
When would you deliberately stop at 3NF instead of going to BCNF?

13 What is the difference between a candidate key and a primary key?

A candidate key is any minimal set of one or more columns whose values uniquely identify every row in a table. Minimal means you cannot remove any column from it and still keep uniqueness, so no column in a candidate key is redundant. A single table can legitimately have several candidate keys.

The primary key is not a different kind of key. It is simply the one candidate key you choose to be the main identifier for the table. That choice carries rules: a primary key column can never be NULL, its values must stay unique, and it is the natural target for the foreign keys that reference this table.

Once you promote one candidate key to primary key, the leftover candidate keys do not lose their meaning. They are called alternate keys, and you keep them enforced by adding a UNIQUE constraint on each. That way the database still guarantees uniqueness for those columns even though they are not the chosen identifier.

It also helps to separate a candidate key from a superkey. A superkey is any set of columns that uniquely identifies a row, even if it carries extra columns you do not need. A candidate key is a superkey with the redundancy stripped out - it is minimal. Every candidate key is a superkey, but not every superkey is a candidate key.

Superkey: uniquely identifies a row, may include unnecessary columns.
Candidate key: a minimal superkey, no column can be dropped.
Primary key: the one candidate key chosen as the main identifier, and NOT NULL.
Alternate key: a candidate key not chosen as primary, enforced with UNIQUE.

Picking the primary key is a design decision. Interviewers often probe whether you would choose a natural candidate key or add a surrogate one, so it is worth being able to weigh surrogate versus natural keys and explain the tradeoff.

Two candidate keys: choose one as PRIMARY KEY, enforce the other with UNIQUE

CREATE TABLE Employees (
    emp_id      INT          NOT NULL,
    national_id CHAR(11)     NOT NULL,
    email       VARCHAR(255) NOT NULL,
    full_name   VARCHAR(120) NOT NULL,
    CONSTRAINT PK_Employees PRIMARY KEY (emp_id),
    CONSTRAINT AK_Employees_NationalId UNIQUE (national_id),
    CONSTRAINT AK_Employees_Email UNIQUE (email)
);

emp_id, national_id and email are all candidate keys. emp_id is chosen as the primary key; the other two stay unique as alternate keys.

A composite candidate key promoted to primary key

CREATE TABLE Enrollment (
    student_id  INT NOT NULL,
    course_id   INT NOT NULL,
    term_code   CHAR(6) NOT NULL,
    grade       CHAR(2) NULL,
    CONSTRAINT PK_Enrollment PRIMARY KEY (student_id, course_id, term_code)
);

The minimal set (student_id, course_id, term_code) is a candidate key; dropping any one column would let duplicate enrollments slip in.

Follow-up questions to expect

Why can a table have only one primary key but many unique constraints?
What is the difference between a candidate key and a superkey?
Would you choose a natural candidate key or a surrogate key as the primary key, and why?
How does a NULL value affect a UNIQUE alternate key in SQL Server?

14 What is fourth normal form (4NF)?

Fourth normal form (4NF) is the next step of database normalization above Boyce Codd normal form (BCNF). A table is in 4NF when two conditions hold: it is already in BCNF, and it has no non trivial multi valued dependency (MVD). Where BCNF is about functional dependencies, 4NF is about multi valued ones.

A multi valued dependency happens when one key independently determines two separate multi valued facts. Say a person can have many skills and can speak many languages, and the two lists have nothing to do with each other. If you cram both into one table keyed on the person, the table is forced to store every combination of skill and language just to represent them, because there is no way to say which skill goes with which language: they are independent.

The result is a Cartesian product explosion: a person with 3 skills and 4 languages needs 3 x 4 = 12 rows, all repeating the person.
That redundancy brings back update anomalies. Add a new language and you must insert one row for every existing skill, or the data becomes inconsistent.
The two facts are logically independent, so keeping them in one table is the mistake 4NF is designed to catch.

The fix is to split the offending table into two, one per independent multi valued fact, each keyed on the person. PersonSkill(person, skill) holds the skills, PersonLanguage(person, language) holds the languages, and the Cartesian blow up disappears. Each table now stores a fact once, and you can add a skill or a language without touching the other list.

In practice a non trivial MVD almost always shows up as a many to many relationship that has been squeezed into the wrong table. If a table pairs two unrelated multi valued attributes on the same key, it violates 4NF and should be decomposed.

Violates 4NF: two independent lists forced into one table

CREATE TABLE PersonSkillLanguage (
    person    VARCHAR(50) NOT NULL,
    skill     VARCHAR(50) NOT NULL,
    language  VARCHAR(50) NOT NULL,
    CONSTRAINT PK_PSL PRIMARY KEY (person, skill, language)
);

-- Alice has skills {SQL, Python} and languages {English, Spanish}.
-- Because skill and language are independent, every pair must be stored:
INSERT INTO PersonSkillLanguage (person, skill, language) VALUES
    ('Alice', 'SQL',    'English'),
    ('Alice', 'SQL',    'Spanish'),
    ('Alice', 'Python', 'English'),
    ('Alice', 'Python', 'Spanish');
-- 2 skills x 2 languages = 4 rows. Add one language and you insert 2 more rows.

The person -> skill and person -> language multi valued dependencies are independent, so the table stores their Cartesian product and repeats everything.

4NF fix: one table per independent multi valued fact

CREATE TABLE PersonSkill (
    person  VARCHAR(50) NOT NULL,
    skill   VARCHAR(50) NOT NULL,
    CONSTRAINT PK_PersonSkill PRIMARY KEY (person, skill)
);

CREATE TABLE PersonLanguage (
    person    VARCHAR(50) NOT NULL,
    language  VARCHAR(50) NOT NULL,
    CONSTRAINT PK_PersonLanguage PRIMARY KEY (person, language)
);

-- Each fact is stored once: 2 skills + 2 languages = 4 rows total, not 4 combined.
INSERT INTO PersonSkill (person, skill) VALUES
    ('Alice', 'SQL'), ('Alice', 'Python');
INSERT INTO PersonLanguage (person, language) VALUES
    ('Alice', 'English'), ('Alice', 'Spanish');

Splitting on each independent MVD removes the Cartesian product: adding a language now inserts a single row and cannot create inconsistent combinations.

Follow-up questions to expect

How is a multi valued dependency different from a functional dependency?
When is a multi valued dependency considered trivial?
What is fifth normal form (5NF) and how does it relate to 4NF?
How would you spot a 4NF violation when reviewing an existing schema?

15 How do you normalize a table step by step to 3NF?

Normalization is the process of splitting a wide, repetitive table into focused tables so that every fact is stored in exactly one place. You reach third normal form (3NF) by applying three rules in order, checking each one before moving to the next. The best way to answer this in an interview is to walk through a concrete example. See the database normalization guide for the full reference.

Imagine one flat Orders table that records, on every line, the order id, the customer name and city, the product name and unit price, and the quantity. Because a single order has several products, the customer details repeat on every line, and the product price is copied into every order that sells that product. That redundancy is what causes update, insertion and deletion anomalies.

First normal form (1NF): atomic values and a key. Ensure each column holds a single value (no comma separated lists of products in one cell) and that the table has a key that identifies each row. Here one order line is identified by the composite key (order_id, product_id), since an order contains many products and a product appears in many orders.

Second normal form (2NF): remove partial dependencies. A table is in 2NF when it is in 1NF and no non-key column depends on only part of a composite key. The customer name and city depend on order_id alone, not on the product, and the product name and price depend on product_id alone. Move each of those into its own table so the line item keeps only what truly depends on the whole key: the quantity.

Third normal form (3NF): remove transitive dependencies. A table is in 3NF when it is in 2NF and no non-key column depends on another non-key column. If the customer table still held the city and a derived region, where region depends on city rather than on the customer key, that transitive dependency would be split out too. Keep pushing attributes to the table whose key they actually describe.

The end state is four clean tables: Customers, Products, Orders (one row per order, referencing the customer), and OrderItems (one row per product on an order, referencing both). Each fact now lives once, and the tables are wired together with foreign keys pointing at primary keys. For more depth and trade offs, read database normalization explained.

Before: one wide, repetitive Orders table

-- Customer and product facts repeat on every line
SELECT order_id, customer_name, customer_city,
       product_name, unit_price, quantity
FROM WideOrders;

-- order_id customer_name customer_city product_name unit_price quantity
-- 1001     Ada Lovelace  London        Keyboard     40.00      2
-- 1001     Ada Lovelace  London        Mouse        18.00      1
-- 1002     Alan Turing   Manchester    Keyboard     40.00      3

Ada and her city repeat per line, and the keyboard price is copied everywhere it sells. Changing that price means many updates.

After: normalized tables to 3NF

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    customer_name NVARCHAR(100) NOT NULL,
    customer_city NVARCHAR(60) NOT NULL
);

CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name NVARCHAR(100) NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

CREATE TABLE Orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL
        REFERENCES Customers(customer_id),
    order_date DATE NOT NULL
);

CREATE TABLE OrderItems (
    order_id INT NOT NULL
        REFERENCES Orders(order_id),
    product_id INT NOT NULL
        REFERENCES Products(product_id),
    quantity INT NOT NULL,
    CONSTRAINT PK_OrderItems
        PRIMARY KEY (order_id, product_id)
);

Each fact lives once: customer details in Customers, price in Products, and OrderItems holds only quantity under the whole composite key.

Reassemble the original view with joins

SELECT o.order_id, c.customer_name, c.customer_city,
       p.product_name, p.unit_price, oi.quantity
FROM Orders AS o
JOIN Customers AS c  ON c.customer_id = o.customer_id
JOIN OrderItems AS oi ON oi.order_id = o.order_id
JOIN Products AS p    ON p.product_id = oi.product_id;

Normalization does not lose information: a join rebuilds the wide row on demand from the single source of each fact.

Follow-up questions to expect

What anomalies does normalization prevent, and can you give an example of each?
When would you deliberately denormalize this design for performance?
What is the difference between 3NF and BCNF?
How do the foreign keys enforce that an order item points at a real order and product?

Keep going

Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.

Get 1-on-1 Prep All Categories