Normalization questions test whether you understand why we split tables. They cover the normal forms, functional dependencies, update anomalies and the trade-offs of denormalization. Each answer shows the before-and-after tables in a diagram with SQL.
Normalization is the process of structuring a relational schema so that data is broken into focused tables and each fact is stored in exactly one place. The goal is to reduce redundancy (the same value repeated in many rows) and remove undesirable dependencies, so that every non-key column depends on the key, the whole key, and nothing but the key.
We normalize primarily to protect data integrity. When a fact is duplicated across rows, the database can drift into contradictory states. These problems are known as anomalies:
Normalization is achieved by progressively applying the normal forms. 1NF requires atomic values and no repeating groups. 2NF removes partial dependencies on part of a composite key. 3NF removes transitive dependencies, where a non-key column depends on another non-key column. BCNF is a stricter version of 3NF for edge cases with overlapping candidate keys. See the normalization guide and this worked walkthrough for the full progression.
The trade-off is read performance. A highly normalized schema spreads data across many tables, so answering a query can require several JOIN operations. For read-heavy analytics or reporting workloads teams sometimes deliberately reverse course with denormalization, accepting controlled redundancy in exchange for fewer joins and faster reads.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
order_date DATE,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_city VARCHAR(60)
);
-- The same customer_name, customer_email and customer_city
-- are duplicated for every order that customer places.
-- Change one email and you must update every matching row.
Customer facts are duplicated across order rows, which invites update, insert and delete anomalies.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_city VARCHAR(60)
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
order_date DATE,
customer_id INT NOT NULL
REFERENCES Customers(customer_id)
);
Each customer is stored once in Customers; Orders points at it by customer_id, so an email change is a single-row update.
Related: What is first normal form (1NF)? · What are update, insertion and deletion anomalies? · What is the difference between normalization and denormalization?
First normal form (1NF) is the first step of database normalization. A table satisfies 1NF when two conditions hold: every column stores a single atomic value, and every row is uniquely identifiable because the table has a key.
Atomic means one value per column per row. A column that packs several values into one cell breaks 1NF, whether the values are a comma-separated list, a repeating group of columns like phone1, phone2, phone3, or an array. The moment you would need to split a cell to answer a question ("which customers have a mobile number?"), the design is not in 1NF.
The fix for a multi-valued column is to split it into separate rows in a related child table. Each value from the offending cell becomes its own row, linked back to the parent by a foreign key. This turns one crowded column into a clean one-to-many relationship that you can index, filter, and join.
1NF makes data queryable with plain SQL: you can count, filter, and join individual values instead of parsing strings. It is the foundation the higher forms (2NF and beyond) build on.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
name NVARCHAR(100),
phones NVARCHAR(200) -- e.g. '555-0100, 555-0101, 555-0102'
);
INSERT INTO Customers (customer_id, name, phones)
VALUES (1, 'Ada Lovelace', '555-0100, 555-0101, 555-0102');
The phones column packs several numbers into one cell, so it violates 1NF. You cannot easily count or search a single number without parsing the string.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
name NVARCHAR(100)
);
CREATE TABLE Phones (
phone_id INT PRIMARY KEY,
customer_id INT NOT NULL REFERENCES Customers(customer_id),
phone NVARCHAR(20) NOT NULL
);
INSERT INTO Phones (phone_id, customer_id, phone)
VALUES (1, 1, '555-0100'),
(2, 1, '555-0101'),
(3, 1, '555-0102');
Each phone number is now atomic and lives in its own row, linked to the customer by a foreign key. This one-to-many design is in 1NF and is easy to query and index.
Related: What is second normal form (2NF)? · What is normalization and why do we do it? · How do you model a one to many relationship?
Second normal form (2NF) is the second step in database normalization. A table is in 2NF when two things are true: it is already in first normal form (1NF), and it has no partial dependency. A partial dependency means a non-key column depends on only part of a composite primary key rather than the whole key.
The key insight is that 2NF is only relevant when the primary key is composite (made of two or more columns). If a table has a single-column primary key, there is no "part" of the key to depend on, so the table is automatically in 2NF once it reaches 1NF.
Consider an OrderItems table keyed on the pair (order_id, product_id). A column like quantity correctly depends on the whole key: the quantity is specific to one product on one order. But a column like product_name depends on product_id alone, not on order_id. That is a partial dependency, and it violates 2NF.
You fix the violation by moving the partially dependent columns into a new table keyed by the part they depend on. Here product_name moves to a Products table keyed on product_id, and OrderItems keeps only quantity, which depends on the full key. Removing partial dependencies is exactly what separates 2NF from a raw 1NF table.
-- Composite primary key is (order_id, product_id)
CREATE TABLE OrderItems (
order_id INT NOT NULL,
product_id INT NOT NULL,
product_name VARCHAR(100) NOT NULL, -- depends on product_id only
quantity INT NOT NULL, -- depends on the whole key
CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id)
);
product_name is a partial dependency: it is fixed by product_id and gets repeated on every order line for that product.
-- product_name now lives where it belongs, keyed by product_id alone
CREATE TABLE Products (
product_id INT NOT NULL,
product_name VARCHAR(100) NOT NULL,
CONSTRAINT PK_Products PRIMARY KEY (product_id)
);
CREATE TABLE OrderItems (
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL, -- depends on the full (order_id, product_id) key
CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id),
CONSTRAINT FK_OrderItems_Products
FOREIGN KEY (product_id) REFERENCES Products (product_id)
);
Each product name is stored once. OrderItems keeps only columns that depend on the entire composite key.
Related: What is first normal form (1NF)? · What is a partial dependency? · What is third normal form (3NF)?
Third normal form (3NF) is the next step in database normalization after second normal form. A table is in 3NF when two things are true: it is already in 2NF, and it has no transitive dependency. A transitive dependency is when a non-key column depends on another non-key column instead of depending directly on the primary key.
The classic memory aid is that in 3NF every non-key column must depend on the key, the whole key, and nothing but the key. The "whole key" part is handled by 2NF (no partial dependency on part of a composite key). The "nothing but the key" part is what 3NF adds: a non-key column is not allowed to be determined by some other non-key column.
A simple example: an Employees table with columns emp_id (the key), dept_id, and dept_name. Here dept_name depends on dept_id, and dept_id is not the key, so emp_id -> dept_id -> dept_name is a transitive dependency. This causes update anomalies: renaming a department means updating many employee rows, and a department with no employees cannot be recorded at all.
The fix is to move the transitively dependent columns into their own table. Keep dept_id as a foreign key in Employees, and put dept_id and dept_name in a separate Departments table where dept_name now depends directly on that table primary key. A deeper walkthrough of normalization shows how each form builds on the last.
CREATE TABLE Employees (
emp_id INT PRIMARY KEY,
emp_name NVARCHAR(100) NOT NULL,
dept_id INT NOT NULL,
dept_name NVARCHAR(100) NOT NULL -- depends on dept_id, not on emp_id
);
emp_id -> dept_id -> dept_name is a transitive dependency, so dept_name is duplicated across every employee in the same department.
CREATE TABLE Departments (
dept_id INT PRIMARY KEY,
dept_name NVARCHAR(100) NOT NULL -- now depends directly on the key
);
CREATE TABLE Employees (
emp_id INT PRIMARY KEY,
emp_name NVARCHAR(100) NOT NULL,
dept_id INT NOT NULL
REFERENCES Departments(dept_id)
);
dept_name lives in Departments and is stored once; Employees keeps dept_id as a foreign key. No non-key column depends on another non-key column.
Related: What is second normal form (2NF)? · What is a transitive dependency? · What is Boyce Codd normal form (BCNF)?
Boyce Codd normal form (BCNF) is a stronger form of database normalization than third normal form (3NF). The rule is short: for every non trivial functional dependency X -> Y, the determinant X must be a superkey of the table. In other words, every determinant has to be a key.
3NF already forbids a non key column from depending on another non key column, but it makes an exception: it still allows a dependency whose right side is part of a candidate key (a prime attribute). BCNF closes that loophole. The two forms only differ for the uncommon case of a table with multiple overlapping candidate keys where some determinant is not a candidate key.
The classic example is a table (student, subject, teacher) under two rules: each teacher teaches exactly one subject, and for a given subject a student is assigned one teacher. Here teacher -> subject holds, but teacher is not a candidate key, so the table is in 3NF yet violates BCNF. That lets update anomalies creep back in: you cannot record which subject a teacher covers until a student enrols, and changing a teacher subject means editing many rows.
(student, subject, teacher) into Teaches(teacher, subject) and Enrolment(student, teacher).teacher is the key of Teaches, so teacher -> subject no longer breaks the rule.One caveat interviewers like to hear: a BCNF decomposition is always lossless, but it is not always dependency preserving, so occasionally a design stays at 3NF on purpose to keep a functional dependency enforceable with a single key. See also 3NF vs BCNF.
CREATE TABLE StudentSubjectTeacher (
student VARCHAR(50) NOT NULL,
subject VARCHAR(50) NOT NULL,
teacher VARCHAR(50) NOT NULL,
-- candidate keys: (student, subject) and (student, teacher)
CONSTRAINT PK_SST PRIMARY KEY (student, subject),
CONSTRAINT UQ_SST UNIQUE (student, teacher)
);
-- teacher -> subject holds, but teacher is not a superkey: this violates BCNF.
The dependency teacher -> subject is legal in 3NF because subject is a prime attribute, yet it still causes update anomalies.
CREATE TABLE Teaches (
teacher VARCHAR(50) NOT NULL,
subject VARCHAR(50) NOT NULL,
CONSTRAINT PK_Teaches PRIMARY KEY (teacher) -- teacher -> subject
);
CREATE TABLE Enrolment (
student VARCHAR(50) NOT NULL,
teacher VARCHAR(50) NOT NULL,
CONSTRAINT PK_Enrolment PRIMARY KEY (student, teacher),
CONSTRAINT FK_Enrolment_Teacher FOREIGN KEY (teacher)
REFERENCES Teaches(teacher)
);
Splitting on the offending determinant makes teacher the key of Teaches, so every functional dependency now has a superkey on its left side.
Related: What is third normal form (3NF)? · What is the difference between 3NF and BCNF? · What is a functional dependency?
A functional dependency, written X -> Y, states that each value of the attribute set X determines exactly one value of the attribute set Y. We say X functionally determines Y, or that Y is functionally dependent on X. If two rows agree on X they must also agree on Y. For example student_id -> student_name: given a student id there is only ever one name for it, while the reverse does not hold because two students can share a name.
The left side X is called the determinant. Determinants are exactly how keys are defined: a candidate key is a minimal set of columns that functionally determines every other column in the table. Dependencies where the determinant is not a candidate key are what normalization works to remove, because they let the same fact be stored (and updated inconsistently) in many rows.
Three shapes of dependency drive the normal forms:
Y depends on the whole of a composite key and not on any part of it. This is the target state that 2NF requires.Y depends on only part of a composite key, for example (order_id, product_id) -> product_name where product_name really depends on product_id alone. Removing partial dependencies gives second normal form.student_id -> dept_id and dept_id -> dept_name, so student_id -> dept_name holds indirectly. Removing transitive dependencies gives third normal form.Functional dependencies are a property of the meaning of the data, not of the rows that happen to be present. You cannot prove a dependency by looking at a sample; you assert it from the business rules and then design tables so that every determinant is a key. Get the dependencies right and the normal forms follow almost mechanically.
CREATE TABLE Students (
student_id INT NOT NULL PRIMARY KEY,
student_name NVARCHAR(100) NOT NULL,
dept_id INT NOT NULL
);
-- student_id -> student_name and student_id -> dept_id.
-- The determinant (student_id) is the primary key, so each
-- fact about a student is stored exactly once.
Every non-key column depends fully on the key, which is what the higher normal forms require.
CREATE TABLE OrderLines (
order_id INT NOT NULL,
product_id INT NOT NULL,
product_name NVARCHAR(100) NOT NULL,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id)
);
-- product_name depends only on product_id, not the whole key.
-- product_id -> product_name is a partial dependency, so the
-- product name is repeated on every order line for that product.
The fix is to split product_name into a Products table keyed by product_id, which reaches 2NF.
Related: What is a transitive dependency? · What is a partial dependency? · What is normalization and why do we do it?
A transitive dependency exists when a non-key column depends on the primary key only through another non-key column, rather than directly on the key. In functional dependency terms, if A -> B and B -> C, then A -> C holds indirectly (transitively) even though C does not depend on A directly. Here A is the key and both B and C are non-key columns.
A classic example is an Orders table keyed on order_id. The order determines which employee handled it (order_id -> employee_id), and the employee determines their own name (employee_id -> employee_name). So order_id -> employee_name holds only through employee_id. The name is stored in the wrong table: it describes the employee, not the order.
order_id.employee_id, not on the key, so it is transitively dependent.Transitive dependencies are exactly what third normal form (3NF) forbids. A table is in 3NF when it is already in second normal form and every non-key column depends on the key directly, not through another non-key column. Storing employee_name in Orders causes update anomalies: if the employee changes their name you must rewrite it on every one of their orders, and an order-less employee cannot be recorded at all. See database normalization for the full progression.
The fix is to split the middle dependency into its own table. Move employee_name into an Employees table keyed on employee_id, and keep only the employee_id foreign key on Orders. Now every non-key column in each table depends on that table's key and nothing else.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
order_date DATE NOT NULL,
employee_id INT NOT NULL,
employee_name VARCHAR(100) NOT NULL
);
-- order_id -> employee_id -> employee_name
-- employee_name depends on the key only through employee_id.
employee_name is transitively dependent on order_id, so the same name is repeated across every order that employee handled.
CREATE TABLE Employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100) NOT NULL
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
order_date DATE NOT NULL,
employee_id INT NOT NULL
REFERENCES Employees(employee_id)
);
The middle dependency now lives in Employees; Orders keeps only the employee_id foreign key, so every non-key column depends directly on its key.
Related: What is third normal form (3NF)? · What is a functional dependency? · What is a partial dependency?
A partial dependency exists when a non-key column is functionally determined by only part of a composite primary key rather than the whole key. In other words, you could work out that column value from just one component of the key, so it does not truly belong in that table.
The important precondition: a partial dependency is only possible when the primary key is composite (made of two or more columns). If the key is a single column, every non-key attribute depends on the whole key by definition, so no partial dependency can exist. This is why single-column-key tables that are already in First Normal Form automatically satisfy 2NF.
Consider an OrderItems table with the composite key (order_id, product_id). A column like quantity depends on the full key, that is correct. But product_name depends only on product_id, one half of the key. That is a partial dependency: the product name is repeated on every order line for that product and can be updated inconsistently.
product_name moves to a Products table keyed by product_id.Eliminating partial dependencies is precisely what Second Normal Form (2NF) does. A table is in 2NF when it is in 1NF and every non-key column depends on the whole primary key, not just a part of it. Do not confuse this with a transitive dependency (non-key column depending on another non-key column), which is what Third Normal Form removes.
CREATE TABLE OrderItems (
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
product_name NVARCHAR(100) NOT NULL, -- depends only on product_id
CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id)
);
product_name depends on only part of the key (product_id), so it repeats on every order line for that product.
CREATE TABLE Products (
product_id INT NOT NULL PRIMARY KEY,
product_name NVARCHAR(100) NOT NULL
);
CREATE TABLE OrderItems (
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
CONSTRAINT PK_OrderItems PRIMARY KEY (order_id, product_id),
CONSTRAINT FK_OrderItems_Products
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);
product_name now lives once in Products; OrderItems keeps only columns that depend on the full (order_id, product_id) key.
Related: What is second normal form (2NF)? · What is a functional dependency? · What is a composite key and when should you use one?
When a single wide table stores several unrelated facts together, the same fact ends up repeated across many rows. That redundancy is what causes the three classic modification anomalies. They are the standard motivation an interviewer expects you to give for normalization: splitting one overloaded table into focused tables so each fact is stored exactly once.
Picture an Orders table that repeats the customer name, the customer city and the product price on every single order line. The data model works, but it is fragile in three specific ways.
The cure is the same in every case: normalize the design so each fact lives in one place. Customers go in a Customers table, products in a Products table, and the Orders table keeps only foreign keys that point at them. A customer city is then stored once, a product can exist with zero orders, and deleting an order cannot remove a customer.
In an interview it is worth naming which normal form each fix relates to. Repeating groups are removed by first and second normal form, while facts that depend on a non-key column are removed by third normal form. All three anomalies disappear once every non-key column depends on the key, the whole key and nothing but the key.
CREATE TABLE Orders (
order_id INT,
customer_name VARCHAR(100),
customer_city VARCHAR(100),
product_name VARCHAR(100),
unit_price DECIMAL(10,2)
);
-- Customer city is repeated on every order row.
-- Moving a customer means updating many rows (update anomaly).
UPDATE Orders
SET customer_city = 'Seattle'
WHERE customer_name = 'Ada Lovelace';
-- Miss a row and the table contradicts itself:
-- some rows say Seattle, some still say the old city.
A new product cannot be added until it is ordered (insertion anomaly), and deleting a customer last order loses their city (deletion anomaly).
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
city VARCHAR(100)
);
CREATE TABLE Products (
product_id INT PRIMARY KEY,
name VARCHAR(100),
unit_price DECIMAL(10,2)
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT REFERENCES Customers(customer_id),
product_id INT REFERENCES Products(product_id)
);
-- City lives in one row now: one update, no inconsistency.
UPDATE Customers SET city = 'Seattle' WHERE customer_id = 1;
A product can exist with no orders, a customer survives deletion of their orders, and each fact changes in exactly one place.
Related: What is normalization and why do we do it? · What is the difference between normalization and denormalization? · What is third normal form (3NF)?
Normalization is the process of organizing columns and tables so that each fact is stored in exactly one place. You split a wide table into smaller focused tables linked by keys, progressing through the normal forms (1NF, 2NF, 3NF and beyond). The goal is to remove redundancy and the update, insert and delete anomalies that redundancy causes. See the normalization guide for a full walkthrough.
Denormalization is the deliberate reverse step: you add controlled redundancy back into a normalized schema. That can mean duplicating a column across tables, storing a precomputed or aggregated value, or merging tables that are almost always queried together. The point is to avoid expensive joins and aggregations at read time.
The right choice depends on the workload. OLTP systems, where many concurrent transactions insert and update small amounts of data, lean toward a normalized design so writes stay cheap and consistent. OLAP and reporting systems, where large read-mostly queries scan and aggregate history, often denormalize (for example into star schemas or summary tables) so dashboards stay fast.
In interviews the key point is that these are not opposites you pick once. You normalize first to get a correct model, then denormalize selectively and only where a measured read bottleneck justifies the cost of maintaining the duplicated data.
SELECT ol.order_id, p.name AS product, ol.quantity, ol.unit_price
FROM OrderLines AS ol
INNER JOIN Products AS p
ON p.product_id = ol.product_id;
The product name lives once in Products; the order line stores only product_id and joins to read the name.
ALTER TABLE OrderLines
ADD line_total AS (quantity * unit_price) PERSISTED;
SELECT order_id, SUM(line_total) AS order_total
FROM OrderLines
GROUP BY order_id;
A PERSISTED computed column stores quantity * unit_price on disk, so read queries skip the arithmetic; SQL Server keeps it in sync automatically.
Related: What is normalization and why do we do it? · When is it acceptable to denormalize? · What are update, insertion and deletion anomalies?
Denormalization is the deliberate introduction of redundancy into a schema to make reads faster. It is acceptable, but only as an optimization you reach for after measuring, never as a default. Start from a properly normalized design, prove with query plans and timings that specific joins or aggregations are the bottleneck, and then denormalize the narrow spot that hurts.
The classic trigger is a read-heavy or reporting workload: the same expensive multi-table join or GROUP BY runs thousands of times, while writes are comparatively rare. Trading a little redundancy for far fewer joins at read time is a good deal when reads vastly outnumber writes. On a write-heavy OLTP path the trade is usually the wrong way round, because every write now has more copies to update.
Common techniques, from lightest to heaviest:
indexed view in SQL Server, a materialized view elsewhere).Every one of these has the same price: you now own the consistency problem. The copy can drift from the truth unless you keep it in sync with triggers, scheduled refresh jobs, or application logic, and each of those adds complexity and a place for bugs. An indexed view is the safest option because the engine keeps it correct automatically, at the cost of slower writes to the base tables.
The rule that keeps this sane: keep the normalized tables as the single source of truth and treat every denormalized structure as a disposable, rebuildable cache. If a summary table is ever wrong you must be able to drop it and regenerate it from the base data. For where these ideas sit in the wider picture see performance tuning.
-- Normalized source of truth stays untouched: Orders.
-- Denormalized cache: one running total per customer.
CREATE TABLE CustomerOrderTotals (
customer_id INT PRIMARY KEY,
order_count INT NOT NULL DEFAULT 0,
total_spent DECIMAL(12,2) NOT NULL DEFAULT 0
);
GO
CREATE TRIGGER trg_Orders_Rollup
ON Orders
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
MERGE CustomerOrderTotals AS t
USING (
SELECT customer_id, COUNT(*) AS c, SUM(total) AS s
FROM inserted
GROUP BY customer_id
) AS i
ON t.customer_id = i.customer_id
WHEN MATCHED THEN
UPDATE SET order_count = t.order_count + i.c,
total_spent = t.total_spent + i.s
WHEN NOT MATCHED THEN
INSERT (customer_id, order_count, total_spent)
VALUES (i.customer_id, i.c, i.s);
END;
Reports read one indexed row per customer instead of scanning and aggregating Orders every time; the trigger pays the consistency cost on write.
CREATE VIEW dbo.SalesByCustomer
WITH SCHEMABINDING
AS
SELECT
customer_id,
COUNT_BIG(*) AS order_count,
SUM(total) AS total_spent
FROM dbo.Orders
GROUP BY customer_id;
GO
-- Materializing it: the unique clustered index persists the result.
CREATE UNIQUE CLUSTERED INDEX IX_SalesByCustomer
ON dbo.SalesByCustomer (customer_id);
SQL Server keeps the indexed view in sync automatically as base rows change, so there is no drift; the trade off is extra work on every write to Orders.
Related: What is the difference between normalization and denormalization? · What is normalization and why do we do it? · When should a schema be normalized versus denormalized?
Both Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF) build on the same goal: eliminate the update, insert, and delete anomalies that come from redundant data driven by functional dependencies. They differ only in how strict the rule about determinants is. See the wider topic on database normalization for the full progression from 1NF onward.
A table is in 3NF when it is already in 2NF and every non-key attribute depends on the key, the whole key, and nothing but the key. More precisely, for every functional dependency X -> Y, either X is a superkey, or Y is a prime attribute (that is, Y is part of some candidate key). That second escape clause is the loophole: 3NF permits a non-key column to determine part of a candidate key in rare cases where candidate keys overlap.
A table is in BCNF when, for every non-trivial functional dependency X -> Y, X is a superkey. There is no exception for prime attributes. In other words, every determinant (the left side of a dependency) must be a candidate key or a superset of one. This is why BCNF is sometimes called 3.5NF: it closes the exact gap that 3NF leaves open.
One important trade-off: 3NF can always be reached while keeping the decomposition both lossless and dependency preserving (every functional dependency can still be checked on a single table). BCNF sometimes cannot. Decomposing a stubborn table into BCNF can occasionally force you to sacrifice dependency preservation, meaning a constraint can only be enforced by joining tables back together. That is the practical reason some designers stop at 3NF.
The classic textbook example is a table like ClassSchedule(Student, Subject, Teacher) where each teacher teaches exactly one subject, and each student studies a subject with one teacher. Candidate keys are {Student, Subject} and {Student, Teacher}. The dependency Teacher -> Subject holds, but Teacher is not a superkey, so the table violates BCNF. It is still in 3NF because Subject is a prime attribute. See the dedicated pages on 3NF and BCNF for deeper walkthroughs.
-- Rules: each Teacher teaches exactly one Subject.
-- Candidate keys: (Student, Subject) and (Student, Teacher).
-- The dependency Teacher -> Subject breaks BCNF because
-- Teacher is not a superkey, yet Subject is a prime attribute
-- so 3NF is still satisfied.
CREATE TABLE ClassSchedule (
Student INT NOT NULL,
Subject VARCHAR(40) NOT NULL,
Teacher VARCHAR(40) NOT NULL,
CONSTRAINT PK_ClassSchedule PRIMARY KEY (Student, Subject)
);
Storing Teacher -> Subject here means the same fact is repeated on every row for that teacher, the redundancy BCNF removes.
-- Split so every determinant is a superkey.
CREATE TABLE TeacherSubject (
Teacher VARCHAR(40) NOT NULL,
Subject VARCHAR(40) NOT NULL,
CONSTRAINT PK_TeacherSubject PRIMARY KEY (Teacher)
);
CREATE TABLE StudentTeacher (
Student INT NOT NULL,
Teacher VARCHAR(40) NOT NULL,
CONSTRAINT PK_StudentTeacher PRIMARY KEY (Student, Teacher),
CONSTRAINT FK_StudentTeacher_Teacher
FOREIGN KEY (Teacher) REFERENCES TeacherSubject (Teacher)
);
Now Teacher is the key of its own table, so Teacher -> Subject no longer causes redundancy. Note the original (Student, Subject) uniqueness can no longer be enforced on a single table, the dependency-preservation cost of BCNF.
Related: What is third normal form (3NF)? · What is Boyce Codd normal form (BCNF)? · What is a functional dependency?
A candidate key is any minimal set of one or more columns whose values uniquely identify every row in a table. Minimal means you cannot remove any column from it and still keep uniqueness, so no column in a candidate key is redundant. A single table can legitimately have several candidate keys.
The primary key is not a different kind of key. It is simply the one candidate key you choose to be the main identifier for the table. That choice carries rules: a primary key column can never be NULL, its values must stay unique, and it is the natural target for the foreign keys that reference this table.
Once you promote one candidate key to primary key, the leftover candidate keys do not lose their meaning. They are called alternate keys, and you keep them enforced by adding a UNIQUE constraint on each. That way the database still guarantees uniqueness for those columns even though they are not the chosen identifier.
It also helps to separate a candidate key from a superkey. A superkey is any set of columns that uniquely identifies a row, even if it carries extra columns you do not need. A candidate key is a superkey with the redundancy stripped out - it is minimal. Every candidate key is a superkey, but not every superkey is a candidate key.
NOT NULL.UNIQUE.Picking the primary key is a design decision. Interviewers often probe whether you would choose a natural candidate key or add a surrogate one, so it is worth being able to weigh surrogate versus natural keys and explain the tradeoff.
CREATE TABLE Employees (
emp_id INT NOT NULL,
national_id CHAR(11) NOT NULL,
email VARCHAR(255) NOT NULL,
full_name VARCHAR(120) NOT NULL,
CONSTRAINT PK_Employees PRIMARY KEY (emp_id),
CONSTRAINT AK_Employees_NationalId UNIQUE (national_id),
CONSTRAINT AK_Employees_Email UNIQUE (email)
);
emp_id, national_id and email are all candidate keys. emp_id is chosen as the primary key; the other two stay unique as alternate keys.
CREATE TABLE Enrollment (
student_id INT NOT NULL,
course_id INT NOT NULL,
term_code CHAR(6) NOT NULL,
grade CHAR(2) NULL,
CONSTRAINT PK_Enrollment PRIMARY KEY (student_id, course_id, term_code)
);
The minimal set (student_id, course_id, term_code) is a candidate key; dropping any one column would let duplicate enrollments slip in.
Related: What is a primary key? · What is the difference between a surrogate key and a natural key? · What is a functional dependency?
Fourth normal form (4NF) is the next step of database normalization above Boyce Codd normal form (BCNF). A table is in 4NF when two conditions hold: it is already in BCNF, and it has no non trivial multi valued dependency (MVD). Where BCNF is about functional dependencies, 4NF is about multi valued ones.
A multi valued dependency happens when one key independently determines two separate multi valued facts. Say a person can have many skills and can speak many languages, and the two lists have nothing to do with each other. If you cram both into one table keyed on the person, the table is forced to store every combination of skill and language just to represent them, because there is no way to say which skill goes with which language: they are independent.
The fix is to split the offending table into two, one per independent multi valued fact, each keyed on the person. PersonSkill(person, skill) holds the skills, PersonLanguage(person, language) holds the languages, and the Cartesian blow up disappears. Each table now stores a fact once, and you can add a skill or a language without touching the other list.
In practice a non trivial MVD almost always shows up as a many to many relationship that has been squeezed into the wrong table. If a table pairs two unrelated multi valued attributes on the same key, it violates 4NF and should be decomposed.
CREATE TABLE PersonSkillLanguage (
person VARCHAR(50) NOT NULL,
skill VARCHAR(50) NOT NULL,
language VARCHAR(50) NOT NULL,
CONSTRAINT PK_PSL PRIMARY KEY (person, skill, language)
);
-- Alice has skills {SQL, Python} and languages {English, Spanish}.
-- Because skill and language are independent, every pair must be stored:
INSERT INTO PersonSkillLanguage (person, skill, language) VALUES
('Alice', 'SQL', 'English'),
('Alice', 'SQL', 'Spanish'),
('Alice', 'Python', 'English'),
('Alice', 'Python', 'Spanish');
-- 2 skills x 2 languages = 4 rows. Add one language and you insert 2 more rows.
The person -> skill and person -> language multi valued dependencies are independent, so the table stores their Cartesian product and repeats everything.
CREATE TABLE PersonSkill (
person VARCHAR(50) NOT NULL,
skill VARCHAR(50) NOT NULL,
CONSTRAINT PK_PersonSkill PRIMARY KEY (person, skill)
);
CREATE TABLE PersonLanguage (
person VARCHAR(50) NOT NULL,
language VARCHAR(50) NOT NULL,
CONSTRAINT PK_PersonLanguage PRIMARY KEY (person, language)
);
-- Each fact is stored once: 2 skills + 2 languages = 4 rows total, not 4 combined.
INSERT INTO PersonSkill (person, skill) VALUES
('Alice', 'SQL'), ('Alice', 'Python');
INSERT INTO PersonLanguage (person, language) VALUES
('Alice', 'English'), ('Alice', 'Spanish');
Splitting on each independent MVD removes the Cartesian product: adding a language now inserts a single row and cannot create inconsistent combinations.
Related: What is Boyce Codd normal form (BCNF)? · What is third normal form (3NF)? · How do you model a many to many relationship?
Normalization is the process of splitting a wide, repetitive table into focused tables so that every fact is stored in exactly one place. You reach third normal form (3NF) by applying three rules in order, checking each one before moving to the next. The best way to answer this in an interview is to walk through a concrete example. See the database normalization guide for the full reference.
Imagine one flat Orders table that records, on every line, the order id, the customer name and city, the product name and unit price, and the quantity. Because a single order has several products, the customer details repeat on every line, and the product price is copied into every order that sells that product. That redundancy is what causes update, insertion and deletion anomalies.
First normal form (1NF): atomic values and a key. Ensure each column holds a single value (no comma separated lists of products in one cell) and that the table has a key that identifies each row. Here one order line is identified by the composite key (order_id, product_id), since an order contains many products and a product appears in many orders.
Second normal form (2NF): remove partial dependencies. A table is in 2NF when it is in 1NF and no non-key column depends on only part of a composite key. The customer name and city depend on order_id alone, not on the product, and the product name and price depend on product_id alone. Move each of those into its own table so the line item keeps only what truly depends on the whole key: the quantity.
Third normal form (3NF): remove transitive dependencies. A table is in 3NF when it is in 2NF and no non-key column depends on another non-key column. If the customer table still held the city and a derived region, where region depends on city rather than on the customer key, that transitive dependency would be split out too. Keep pushing attributes to the table whose key they actually describe.
The end state is four clean tables: Customers, Products, Orders (one row per order, referencing the customer), and OrderItems (one row per product on an order, referencing both). Each fact now lives once, and the tables are wired together with foreign keys pointing at primary keys. For more depth and trade offs, read database normalization explained.
-- Customer and product facts repeat on every line
SELECT order_id, customer_name, customer_city,
product_name, unit_price, quantity
FROM WideOrders;
-- order_id customer_name customer_city product_name unit_price quantity
-- 1001 Ada Lovelace London Keyboard 40.00 2
-- 1001 Ada Lovelace London Mouse 18.00 1
-- 1002 Alan Turing Manchester Keyboard 40.00 3
Ada and her city repeat per line, and the keyboard price is copied everywhere it sells. Changing that price means many updates.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name NVARCHAR(100) NOT NULL,
customer_city NVARCHAR(60) NOT NULL
);
CREATE TABLE Products (
product_id INT PRIMARY KEY,
product_name NVARCHAR(100) NOT NULL,
unit_price DECIMAL(10,2) NOT NULL
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL
REFERENCES Customers(customer_id),
order_date DATE NOT NULL
);
CREATE TABLE OrderItems (
order_id INT NOT NULL
REFERENCES Orders(order_id),
product_id INT NOT NULL
REFERENCES Products(product_id),
quantity INT NOT NULL,
CONSTRAINT PK_OrderItems
PRIMARY KEY (order_id, product_id)
);
Each fact lives once: customer details in Customers, price in Products, and OrderItems holds only quantity under the whole composite key.
SELECT o.order_id, c.customer_name, c.customer_city,
p.product_name, p.unit_price, oi.quantity
FROM Orders AS o
JOIN Customers AS c ON c.customer_id = o.customer_id
JOIN OrderItems AS oi ON oi.order_id = o.order_id
JOIN Products AS p ON p.product_id = oi.product_id;
Normalization does not lose information: a join rebuilds the wide row on demand from the single source of each fact.
Related: What is normalization and why do we do it? · What is third normal form (3NF)? · How do you model a one to many relationship?
Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.