Design questions test how you model real problems: relationships, keys, data types and ER diagrams. Interviewers want to see structured thinking. Each answer includes an ER-style diagram and the SQL that implements the design.
A one to many relationship means one row in the parent (the "one" side) can be linked to many rows in the child (the "many" side), while each child row links back to exactly one parent. The classic example is Customers 1 - N Orders: a single customer can place many orders, but each order belongs to just one customer.
You model this by placing a foreign key on the many side. The child table (Orders) gets a customer_id column that references the primary key of the parent table (Customers). You never store the relationship on the one side, because a single column cannot hold a list of child rows; the pointer always lives on the child.
The foreign key does more than document intent, it is enforced. SQL Server rejects any insert or update that sets Orders.customer_id to a value that does not exist in Customers, so you can never end up with an orphaned order that points at a customer who was never created. Trying to insert such a row raises error 547, the foreign key conflict.
customer_id to NULL, which requires the column to be nullable.In practice you almost always want the default NO ACTION for records like orders, so financial history is never silently destroyed. CASCADE fits owned detail rows (like order line items) where a child has no meaning without its parent.
CREATE TABLE Customers (
customer_id INT IDENTITY(1,1) PRIMARY KEY,
name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL
);
customer_id is the primary key that the child table will point back to.
CREATE TABLE Orders (
order_id INT IDENTITY(1,1) PRIMARY KEY,
customer_id INT NOT NULL,
order_date DATE NOT NULL DEFAULT SYSUTCDATETIME(),
total DECIMAL(10,2) NOT NULL,
CONSTRAINT FK_Orders_Customers FOREIGN KEY (customer_id)
REFERENCES Customers (customer_id)
ON DELETE NO ACTION
);
customer_id is NOT NULL so every order must name a customer, and the FK guarantees that customer exists.
Related: What is a foreign key and what does it enforce? · How do you model a many to many relationship? · When and how do you model a one to one relationship?
A many to many (M:N) relationship means a row on each side can relate to many rows on the other. A student can enroll in many courses, and a course can have many students. A relational table cannot express this directly: you would have to stuff a list of course ids into a single Students column, which breaks first normal form and makes joins, constraints, and indexing impossible.
The standard solution is to add a third table, called a junction table (also known as a bridge, link, or associative table). It sits between the two entities and holds one foreign key pointing at the primary key of each side. Every row in the junction represents a single pairing, such as one student in one course.
student_id and course_id.(student_id, course_id) is declared as the primary key, which guarantees the same student cannot be enrolled in the same course twice.A key benefit is that the junction table is a natural home for attributes of the relationship itself. An enrollment is not just a link, it has its own facts: a grade and an enrolled_date belong on the junction row because they describe the pairing, not the student or the course alone.
So Students and Courses each stay lean, and an Enrollments table in the middle records who is in what, when they joined, and how they did. Modeling M:N this way is one of the most common data design tasks an interviewer will probe, so be ready to name the junction table, its two foreign keys, and its composite key.
CREATE TABLE Students (
student_id INT PRIMARY KEY,
full_name NVARCHAR(100) NOT NULL
);
CREATE TABLE Courses (
course_id INT PRIMARY KEY,
title NVARCHAR(100) NOT NULL
);
Each side is a plain table with its own primary key and no reference to the other.
CREATE TABLE Enrollments (
student_id INT NOT NULL,
course_id INT NOT NULL,
enrolled_date DATE NOT NULL DEFAULT (GETDATE()),
grade CHAR(2) NULL,
CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
CONSTRAINT FK_Enrollments_Student
FOREIGN KEY (student_id) REFERENCES Students (student_id),
CONSTRAINT FK_Enrollments_Course
FOREIGN KEY (course_id) REFERENCES Courses (course_id)
);
The composite primary key (student_id, course_id) blocks duplicate enrollments, and grade and enrolled_date describe the pairing itself.
Related: What is a junction table and when do you need one? · How do you model a one to many relationship? · What is a composite key and when should you use one?
A one to one relationship means a row in one table matches at most one row in another, and vice versa. You model it by having both tables share the same key: the child table stores the parent key as its own primary key, and that same column is also a foreign key pointing back at the parent. Because the child key is unique (it is the PK) and references the parent, each parent row can have at most one child row.
The key design choice is the UNIQUE constraint. In a one to many relationship the foreign key column allows duplicates; in a one to one you enforce uniqueness by making that foreign key column the primary key of the child table. That single decision is what turns "many" into "one".
When is splitting worth it? The common reasons are:
The alternative is almost always to keep everything in one table. A genuine one to one split adds a join to every query that needs both halves, so it is the less common relationship type. Reach for it only when a concrete driver above applies; if you cannot name one, prefer the single table. When you do split, define the child key as PRIMARY KEY and FOREIGN KEY together, and cascade deletes if the child cannot exist without its parent.
CREATE TABLE Users (
user_id INT NOT NULL IDENTITY(1,1),
email VARCHAR(255) NOT NULL,
CONSTRAINT PK_Users PRIMARY KEY (user_id),
CONSTRAINT UQ_Users_email UNIQUE (email)
);
CREATE TABLE UserProfiles (
user_id INT NOT NULL,
ssn CHAR(11) NULL,
bio NVARCHAR(MAX) NULL,
-- user_id is BOTH the primary key and a foreign key,
-- so each user has at most one profile row.
CONSTRAINT PK_UserProfiles PRIMARY KEY (user_id),
CONSTRAINT FK_UserProfiles_Users FOREIGN KEY (user_id)
REFERENCES Users (user_id) ON DELETE CASCADE
);
The child PK is also the FK, so the shared user_id can appear at most once per user, which enforces one to one.
SELECT u.user_id, u.email, p.ssn, p.bio
FROM Users AS u
LEFT JOIN UserProfiles AS p
ON p.user_id = u.user_id;
A LEFT JOIN returns every user and fills NULLs when no profile row exists yet, which is common when the child columns are optional.
Related: How do you model a one to many relationship? · How do you choose the right data type for a column? · How do you choose a good primary key?
An entity relationship (ER) diagram is a picture of a database design. It captures three things: entities (the things you store data about, which usually become tables), attributes (the properties of each entity, which become columns) and relationships (how entities connect to each other). Drawing this before writing any SQL lets a team agree on the structure and spot problems while they are still cheap to fix.
Each relationship carries a cardinality: how many rows on one side can relate to rows on the other. This is shown with crow's foot notation, where the symbols at the end of a line mean "one" (a single bar) or "many" (the splayed three-pronged foot). The common shapes are one-to-one, one-to-many and many-to-many.
An ER diagram maps directly onto a real relational schema:
CREATE TABLE.Because it doubles as a communication tool and a build plan, the ER diagram is where design decisions like normalization are made visible. A well drawn diagram tells you exactly which tables, keys and foreign keys to create. You can read a fuller walkthrough in Entity Relationship Diagrams (ERD) Explained.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
name NVARCHAR(100) NOT NULL
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
total DECIMAL(10,2) NOT NULL,
CONSTRAINT FK_Orders_Customers
FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
);
The foreign key lives on the many side (Orders) and points at the primary key on the one side (Customers).
CREATE TABLE Students (
student_id INT PRIMARY KEY,
name NVARCHAR(100) NOT NULL
);
CREATE TABLE Courses (
course_id INT PRIMARY KEY,
title NVARCHAR(100) NOT NULL
);
CREATE TABLE Enrollments (
student_id INT NOT NULL,
course_id INT NOT NULL,
CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES Students(student_id),
FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);
A many-to-many relationship becomes a third table whose composite key holds a foreign key to each side.
Related: How do you model a one to many relationship? · How do you model a many to many relationship? · What is normalization and why do we do it?
Both are candidates for a table's primary key, but they come from opposite places. A natural key is a column (or set of columns) drawn from the real world that already uniquely identifies a row, for example an email address, an ISBN, a country code or a Social Security number. A surrogate key is an artificial value the system generates purely to identify the row, such as an IDENTITY column, a SEQUENCE value or a GUID. It carries no business meaning at all.
The core trade-off is meaning versus stability. A natural key is readable and self-explanatory, and it enforces a genuine business rule for free. Its weaknesses are that business data can change (people change email or surname, a product gets re-coded) and it is often wide or composite, which makes it a bulky value to copy into every foreign key and index.
A surrogate key is stable because nothing in the business ever forces it to change, and it is narrow and fast: a 4-byte INT makes a compact clustered index and a cheap foreign key. Its downside is that it is meaningless, so a join is always required to make sense of the row, and by itself it does not stop duplicate business data.
The common professional pattern is to use both: a surrogate key as the primary key that foreign keys reference, plus a separate UNIQUE constraint on the natural key so the database still rejects duplicate real-world values. Skipping that unique constraint is the classic mistake, because a surrogate key alone lets you insert the same customer twice.
CREATE TABLE Customers (
CustomerId INT IDENTITY(1,1) PRIMARY KEY,
Email VARCHAR(255) NOT NULL,
FullName VARCHAR(120) NOT NULL,
CONSTRAINT UQ_Customers_Email UNIQUE (Email)
);
CustomerId is meaningless and stable, so foreign keys point at it; the UNIQUE constraint on Email still blocks duplicate customers.
CREATE TABLE Countries (
CountryCode CHAR(2) PRIMARY KEY, -- ISO code, e.g. 'US', 'GB'
CountryName VARCHAR(80) NOT NULL
);
A small, truly stable natural key like an ISO country code is a good case for skipping a surrogate.
Related: How do you choose a good primary key? · What is the difference between a candidate key and a primary key? · What is the difference between a primary key and a unique key?
A primary key uniquely identifies every row in a table and is referenced by foreign keys in other tables, so the choice ripples across the whole schema. Judge every candidate against four hard criteria and one strong preference.
INT or BIGINT. The key is copied into every foreign key and every nonclustered index, so a wide key inflates storage and slows joins and lookups.Those criteria push you toward a surrogate key: a system generated, meaningless integer produced by an IDENTITY column or a SEQUENCE. It is narrow, guaranteed unique, and never has a business reason to change. A natural key uses real business data such as an email address or product code. It can work when a genuinely stable and compact column already exists, but business values tend to change and are often wider than an integer, which is why surrogate keys are the common default.
A GUID (UNIQUEIDENTIFIER) is another surrogate option. Its advantage is that it is globally unique, so you can generate keys on many clients or servers without collisions, which suits distributed systems and merge replication. The trade offs are that it is wide (16 bytes versus 4 for an INT) and, when generated randomly with NEWID(), its values arrive out of order and cause heavy fragmentation of a clustered index and frequent page splits. If a GUID must be the clustered key, use sequential values from NEWSEQUENTIALID() so inserts append in order.
Remember that the primary key and the clustered index are related but separate choices. By default SQL Server makes the primary key the clustered index, and the clustering key is stored in every nonclustered index, so a narrow ever increasing key keeps those indexes small and inserts cheap. See SQL indexes for how the clustered index shapes physical row order, and primary key vs foreign key for how keys tie tables together.
CREATE TABLE Customers (
customer_id INT IDENTITY(1,1) NOT NULL,
email NVARCHAR(256) NOT NULL,
full_name NVARCHAR(100) NOT NULL,
CONSTRAINT PK_Customers PRIMARY KEY (customer_id),
CONSTRAINT UQ_Customers_email UNIQUE (email)
);
The narrow, meaningless customer_id is the primary key; the natural email value gets a UNIQUE constraint instead of being the key so it can change without breaking foreign keys.
CREATE TABLE Devices (
device_id UNIQUEIDENTIFIER NOT NULL
CONSTRAINT DF_Devices_id DEFAULT NEWSEQUENTIALID(),
serial_no NVARCHAR(64) NOT NULL,
CONSTRAINT PK_Devices PRIMARY KEY CLUSTERED (device_id)
);
NEWSEQUENTIALID() hands out increasing GUIDs so inserts append at the end of the clustered index, avoiding the page splits and fragmentation a random NEWID() would cause.
Related: What is the difference between a surrogate key and a natural key? · What is a primary key? · What is a composite key and when should you use one?
A composite key (also called a compound key) is a primary key or unique key built from two or more columns. No single column on its own is unique, but the combination of columns is guaranteed unique, so together they identify each row. You declare it by listing the columns in one PRIMARY KEY or UNIQUE constraint rather than marking a single column.
The classic fit is a junction table that resolves a many-to-many relationship. In an Enrollments table, a student can take many courses and a course has many students, so neither student_id nor course_id is unique by itself. The pair (student_id, course_id) is the natural identity of an enrollment, and making it the primary key also stops the same student being enrolled in the same course twice.
Column order matters. A composite PRIMARY KEY is backed by an index, and that index is ordered by the first column, then the second, and so on (like the alphabetical order of a phone book by last name then first name). A query that filters on student_id can seek that index, but a query that filters only on course_id cannot, because the second column is not usefully sorted on its own. Put the column you filter by most, or the more selective one, first, and add a separate index for the other access pattern if you need it.
INT IDENTITY or GUID) when the natural key is wide, changes over time, or is referenced by many other tables. A slim single-column key makes foreign keys, joins and application code simpler, and you can still add a UNIQUE constraint on the natural columns to keep the rule enforced.In short, reach for a composite key when a combination of columns is the true identity of the row; reach for a surrogate key when you want a small, stable handle that is easy for the rest of the schema to point at.
CREATE TABLE Enrollments (
student_id INT NOT NULL,
course_id INT NOT NULL,
enrolled_on DATE NOT NULL DEFAULT SYSUTCDATETIME(),
CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
CONSTRAINT FK_Enroll_Student FOREIGN KEY (student_id) REFERENCES Students(student_id),
CONSTRAINT FK_Enroll_Course FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);
The pair (student_id, course_id) is unique, so a student cannot enrol in the same course twice. The key columns are also the two foreign keys.
-- Seeks the PK index (leads with student_id):
SELECT course_id FROM Enrollments WHERE student_id = 42;
-- Cannot seek the PK index efficiently (course_id is the 2nd column):
SELECT student_id FROM Enrollments WHERE course_id = 7;
-- Add a supporting index for the reverse access pattern:
CREATE INDEX IX_Enrollments_Course ON Enrollments (course_id, student_id);
The composite PK is ordered by student_id first, so filtering only on course_id needs its own index.
Related: How do you model a many to many relationship? · Why does the column order in a composite index matter? · How do you choose a good primary key?
An org chart is a tree: each employee reports to exactly one manager, and a manager can have many reports. The default and most common way to store this is the adjacency list, where the table holds a parent_id column that is a foreign key pointing back at the same table primary key. One extra column captures the entire hierarchy, no matter how deep.
The adjacency list is trivial to write to: inserting an employee or moving them under a new manager is a single-row change. The trade-off is reads. To answer "who are all the reports beneath this manager?" you must walk an unknown number of levels, which is where a recursive CTE comes in. A recursive CTE has an anchor member (the starting row) and a recursive member that repeatedly joins back to the CTE until no more rows are found.
When you point a foreign key at its own table you should think carefully about the ON DELETE rule. SQL Server refuses a self-referencing ON DELETE CASCADE because it could form a cycle, raising error 1785. Leave the action as NO ACTION and reparent or delete children explicitly.
Other models exist and are worth naming to show breadth:
/1/4/9/. Subtree reads become a fast LIKE prefix scan, but the path must be rewritten whenever a node moves and referential integrity is weaker.GetAncestor() and IsDescendantOf(). It is convenient but SQL Server specific.For most applications the guidance is simple: start with an adjacency list and a recursive CTE, and only reach for nested sets or hierarchyid when read patterns on very deep or very wide trees prove it necessary.
CREATE TABLE Employees (
emp_id INT NOT NULL PRIMARY KEY,
name NVARCHAR(100) NOT NULL,
manager_id INT NULL,
CONSTRAINT FK_Employees_Manager
FOREIGN KEY (manager_id) REFERENCES Employees (emp_id)
);
manager_id points back at emp_id in the same table. The CEO row has manager_id = NULL. Do not use ON DELETE CASCADE here (error 1785).
WITH OrgChart AS (
-- anchor: the starting manager
SELECT emp_id, name, manager_id, 0 AS depth
FROM Employees
WHERE emp_id = 1
UNION ALL
-- recursive: everyone who reports to a row already found
SELECT e.emp_id, e.name, e.manager_id, oc.depth + 1
FROM Employees AS e
INNER JOIN OrgChart AS oc
ON e.manager_id = oc.emp_id
)
SELECT emp_id, name, depth
FROM OrgChart
ORDER BY depth, name;
The depth column shows how many levels below the starting manager each employee sits. Add OPTION (MAXRECURSION 100) to cap runaway recursion.
Related: What is a recursive CTE and when do you need one? · What is a self join and when would you use one? · How do you model a one to many relationship?
A star schema and a snowflake schema are both dimensional models used in data warehousing and OLAP (analytical) workloads. Each is organized around a central fact table that stores measurable events (sales, clicks, shipments) as numeric measures plus foreign keys, surrounded by dimension tables that describe those facts (product, customer, date, store). The difference is purely in how the dimension tables themselves are structured.
In a star schema each dimension is a single, denormalized table. A Product dimension holds the product name, category, and brand all in one row, even though category and brand repeat across many products. The layout looks like a star: the fact table in the middle with dimensions radiating out one join away. Because any query touches only the fact table and one level of dimensions, joins are few and the query optimizer has an easy job, so star schemas are fast to read and simple to understand.
A snowflake schema takes those dimensions and normalizes them into related sub-tables. The Product dimension is split so that category and brand live in their own lookup tables, referenced by foreign keys. This removes repeated text and saves storage, but a query now has to traverse several joins (fact to product, product to category, category to department) to get the same descriptive columns, which makes queries more complex and usually slower.
The choice reflects a broader split between OLTP and OLAP thinking. Transactional (OLTP) systems favor heavy normalization to avoid update anomalies when data changes constantly. Analytical (OLAP) warehouses are read-mostly and load data in controlled batches, so the classic advice is to denormalize for query speed and prefer a star schema, reaching for snowflaking only when a dimension is very large or its shared attributes genuinely need to be managed in one place. See normalization vs denormalization for the underlying trade-off.
SELECT p.category,
d.calendar_year,
SUM(f.sales_amount) AS total_sales
FROM FactSales AS f
INNER JOIN DimProduct AS p
ON p.product_key = f.product_key
INNER JOIN DimDate AS d
ON d.date_key = f.date_key
GROUP BY p.category, d.calendar_year;
category lives directly on DimProduct, so the fact table joins each dimension exactly once.
SELECT c.category_name,
d.calendar_year,
SUM(f.sales_amount) AS total_sales
FROM FactSales AS f
INNER JOIN DimProduct AS p
ON p.product_key = f.product_key
INNER JOIN DimCategory AS c
ON c.category_key = p.category_key
INNER JOIN DimDate AS d
ON d.date_key = f.date_key
GROUP BY c.category_name, d.calendar_year;
category has been normalized into DimCategory, so reaching category_name requires a second hop through DimProduct.
Related: When should a schema be normalized versus denormalized? · When is it acceptable to denormalize? · How do you model a one to many relationship?
Choosing a data type is a design decision, not a formality. The right type enforces correctness (you cannot store a name in an INT), and it controls how much space each row uses on disk and in memory. The guiding rule is simple: pick the smallest type that safely fits every value the column will ever hold. Narrower columns produce smaller rows, smaller indexes and less I/O, so more data fits on each page and queries run faster.
INT (about +/-2.1 billion) or BIGINT when you will exceed that. Use TINYINT or SMALLINT for small bounded ranges.DECIMAL(p, s) (also spelled NUMERIC). Never use FLOAT or REAL for money: they are binary floating point and cannot represent values like 0.10 exactly, so sums drift by rounding error.VARCHAR(n) for ASCII and NVARCHAR(n) when the column must hold Unicode (names, emoji, non-Latin scripts). Size n to the real maximum; avoid a blanket VARCHAR(MAX) or huge widths you never use.BIT, which SQL Server can pack several of into a single byte.DATE for a calendar day and DATETIME2 for a timestamp. Prefer DATETIME2 over the legacy DATETIME: it has a wider range, higher precision and often uses fewer bytes.Two choices deserve extra care. First, use DECIMAL, not FLOAT, for anything you add up or compare for equality, especially currency. Reserve FLOAT for scientific or approximate measurements where tiny rounding is acceptable. Second, use DATETIME2 instead of DATETIME in new designs; keep DATETIME only for compatibility with old columns.
Watch out for oversizing. Declaring VARCHAR(4000) for a column that never exceeds 50 characters wastes nothing on disk (varchar stores only what you put in it), but it inflates memory grants and the optimizer's size estimates, and it can push a row or index past size limits. Right-size the declared length to the data.
Finally, keep types consistent across a foreign key and its primary key. If Orders.customer_id is INT but Customers.customer_id is BIGINT, every join forces an implicit conversion, which can block index seeks and cause errors such as conversion error 245 or arithmetic overflow 8114. Match the type and length on both sides. For a fuller tour of the type system see SQL data types explained.
CREATE TABLE dbo.Products (
product_id INT NOT NULL PRIMARY KEY,
sku VARCHAR(20) NOT NULL,
name NVARCHAR(100) NOT NULL,
price DECIMAL(10,2) NOT NULL,
is_active BIT NOT NULL DEFAULT 1,
created_at DATETIME2(0) NOT NULL DEFAULT SYSUTCDATETIME()
);
DECIMAL keeps money exact, NVARCHAR holds Unicode names, BIT stores the flag, and DATETIME2 timestamps the row.
-- Customers.customer_id is BIGINT but Orders.customer_id is INT
SELECT c.name, o.order_id
FROM Customers AS c
JOIN Orders AS o
ON o.customer_id = c.customer_id;
-- SQL Server must CONVERT one side per row, which can
-- prevent an index seek and risk error 245 / overflow 8114.
Declare a foreign key with the same type and length as the primary key it references to avoid this.
Related: What is the difference between CHAR and VARCHAR? · How do you choose a good primary key? · When and how do you model a one to one relationship?
A junction table - also called a bridge or associative table - is the standard way to model a many to many relationship in a relational database. You need one whenever a row on each side can relate to many rows on the other side: a student takes many courses and a course has many students, or an order contains many products and a product appears in many orders.
A relational schema cannot express many to many directly, because a single foreign key column can only point at one parent row. The fix is to introduce a third table that sits between the two entities and resolves the many to many into two one to many relationships. Each parent has many junction rows, and each junction row points back to exactly one parent on each side.
student_id and course_id), each referencing the primary key of its parent table.(student_id, course_id) guarantees a given pair can only be stored once, so a student cannot be enrolled in the same course twice.Enforce referential integrity with FOREIGN KEY constraints on both columns and, ideally, an ON DELETE CASCADE so removing a parent cleans up its junction rows. If the pairing needs to be referenced by yet another table, or the natural key is wide, teams sometimes add a surrogate IDENTITY key and enforce the pair with a separate UNIQUE constraint instead.
Junction tables are a normal part of a normalized design. If you ever find yourself storing a comma separated list of ids in a single column, that is the signal you actually needed a junction table.
CREATE TABLE Enrollments (
student_id INT NOT NULL,
course_id INT NOT NULL,
CONSTRAINT PK_Enrollments PRIMARY KEY (student_id, course_id),
CONSTRAINT FK_Enroll_Student FOREIGN KEY (student_id)
REFERENCES Students (student_id),
CONSTRAINT FK_Enroll_Course FOREIGN KEY (course_id)
REFERENCES Courses (course_id)
);
The composite PRIMARY KEY on both foreign keys prevents the same student-course pair being inserted twice.
CREATE TABLE OrderLines (
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL DEFAULT (1),
CONSTRAINT PK_OrderLines PRIMARY KEY (order_id, product_id),
CONSTRAINT FK_Line_Order FOREIGN KEY (order_id)
REFERENCES Orders (order_id) ON DELETE CASCADE,
CONSTRAINT FK_Line_Product FOREIGN KEY (product_id)
REFERENCES Products (product_id)
);
quantity belongs to the pairing, not to either parent, so it lives on the junction row alongside the two foreign keys.
Related: How do you model a many to many relationship? · What is a composite key and when should you use one? · How do you model a one to many relationship?
Both approaches answer the question "how do we get rid of a record?", but they trade permanence for recoverability in opposite ways. A hard delete issues a real DELETE and the row physically leaves the table. A soft delete leaves the row in place and instead sets a marker column that means "treat this as gone", so every read has to filter the deleted rows out.
The marker is usually one of two shapes. An is_deleted bit (or status value) is the simplest: it answers only yes or no. A deleted_at datetime is richer because a non-NULL value both flags the row and records when it happened, which pairs well with audit history. Many teams add deleted_by alongside it.
Why choose soft delete? It preserves history for audits and undo, keeps referential integrity intact because child rows still point at a parent that exists, and lets you answer "what did this look like last month" questions. It is the natural fit when data is legally or operationally valuable.
WHERE is_deleted = 0 (or deleted_at IS NULL) everywhere, and one forgotten filter leaks deleted data. A filtered view or row-level security helps.UNIQUE on email blocks a user from re-registering an address that belongs to a soft-deleted row. The fix is a filtered unique index that only covers live rows.When you still need a hard delete: compliance rules such as GDPR right to erasure require that personal data actually leaves the system, so a flag is not enough. A common pattern is to soft delete for day-to-day use and then run a real DELETE (or anonymization) job to purge data once a retention window closes. See also DELETE vs TRUNCATE vs DROP for how the removal itself behaves.
-- Mark the row deleted instead of removing it
UPDATE Users
SET is_deleted = 1,
deleted_at = SYSUTCDATETIME(),
deleted_by = @current_user_id
WHERE user_id = 42;
-- Normal reads must exclude soft-deleted rows
SELECT user_id, email
FROM Users
WHERE is_deleted = 0;
-- Filtered unique index: email is unique among LIVE rows only,
-- so a soft-deleted email can be re-used by a new signup.
CREATE UNIQUE INDEX UX_Users_Email_Live
ON Users (email)
WHERE is_deleted = 0;
The row survives for audit and restore, and the filtered index keeps the unique rule without blocking re-registration of a deleted address.
-- The row physically leaves the table and cannot be filtered back in
DELETE FROM Users
WHERE user_id = 42;
Use this for a genuine purge, for example honoring a GDPR erasure request after any retention window has passed.
Related: How do you model audit history or change tracking? · How do you enforce referential integrity? · What is the difference between DELETE, TRUNCATE and DROP?
Referential integrity means every value in a child column that references another table always points at a row that actually exists (or is NULL). You enforce it declaratively with a FOREIGN KEY constraint: it links a child column to the primary key (or a unique key) of the parent table, and the engine refuses any INSERT or UPDATE that would leave a child pointing at a parent that is not there.
The constraint also decides what happens to child rows when the parent row is deleted or its key is updated. You pick a referential action per direction with ON DELETE and ON UPDATE:
Always enforce this in the database, not just in application code. The database is the one gatekeeper every client, ETL job, ad hoc script and future service must pass through, so a constraint is the only rule that cannot be bypassed by a bug, a race condition, or a direct SQL edit. Application checks are best effort duplicates of a guarantee the engine can give you for free.
When a write breaks the rule SQL Server raises error 547, the constraint conflict. If you build cascading chains that could reach the same table by more than one path, defining the constraint fails with error 1785 (multiple cascade paths); the fix is to make some paths NO ACTION and handle them in a trigger or in code. See SQL constraints explained for how FKs sit alongside CHECK, UNIQUE and DEFAULT.
Finally, understand the cost of not trusting your constraints. Loading data with WITH NOCHECK (or adding a constraint with WITH NOCHECK) leaves the constraint not trusted: existing rows are never validated, and the query optimizer can no longer assume the relationship holds, so it skips optimizations that a trusted constraint would allow. Re-validate with WITH CHECK CHECK CONSTRAINT so the constraint becomes trusted again.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
total DECIMAL(10,2) NOT NULL,
CONSTRAINT FK_Orders_Customers
FOREIGN KEY (customer_id)
REFERENCES Customers (customer_id)
ON DELETE CASCADE
ON UPDATE NO ACTION
);
Deleting a customer automatically deletes the orders for that customer; an insert with an unknown customer_id fails with error 547.
CREATE TABLE Employees (
emp_id INT PRIMARY KEY,
name NVARCHAR(100) NOT NULL,
dept_id INT NULL,
CONSTRAINT FK_Employees_Departments
FOREIGN KEY (dept_id)
REFERENCES Departments (dept_id)
ON DELETE SET NULL
);
Deleting a department leaves its employees in place but sets their dept_id to NULL, so the column must be nullable.
Related: What is a foreign key and what does it enforce? · How do you model a one to many relationship? · What is the difference between a soft delete and a hard delete?
The decision comes down to whether the workload is write-heavy or read-heavy. Normalization is the process of splitting data into focused tables so that each fact is stored exactly once, with relationships expressed through foreign keys. Denormalization deliberately reverses some of that, folding related data back together or duplicating a column so that common reads touch fewer tables.
For OLTP (online transaction processing) systems - orders, payments, user accounts, anything that is updated constantly - the default is to normalize, usually to third normal form (3NF). When every fact lives in one place, an update, insert or delete changes a single row and there is no chance of two copies disagreeing. This is what protects you from update, insertion and deletion anomalies and keeps write integrity high. Storing a customer address once means correcting it once, not hunting down every order that copied it.
Denormalization is a controlled trade-off, not a shortcut. When you copy a value or precompute an aggregate, you take on the responsibility of keeping every copy consistent, typically through triggers, application logic, or a scheduled batch reload. That is acceptable in a data warehouse that is loaded in controlled batches and rarely updated in place, which is exactly why analytical models such as a star schema keep dimensions denormalized for speed.
The professional discipline is to measure before you denormalize. Start normalized, look at the actual execution plans and slow queries, and only introduce redundancy where profiling proves that joins are the real bottleneck and indexing cannot fix it. Denormalizing on a hunch trades away integrity for a performance gain you never confirmed. For the underlying normal forms and anomalies, see the normalization questions.
SELECT o.order_id,
p.product_name,
c.category_name,
o.quantity
FROM Orders AS o
INNER JOIN Products AS p
ON p.product_id = o.product_id
INNER JOIN Categories AS c
ON c.category_id = p.category_id;
category_name is stored once in Categories; every read joins through Products to fetch it, so a rename updates a single row.
-- category_name is duplicated onto Products for fast reads
SELECT o.order_id,
p.product_name,
p.category_name,
o.quantity
FROM Orders AS o
INNER JOIN Products AS p
ON p.product_id = o.product_id;
-- Cost of the redundancy: you must keep the copy in sync
UPDATE Products
SET category_name = 'Home & Kitchen'
WHERE category_id = 7;
One fewer join per read, but renaming a category now means updating every affected Products row, not one Categories row.
Related: What is normalization and why do we do it? · When is it acceptable to denormalize? · What is the difference between a star schema and a snowflake schema?
Audit history means keeping a durable record of how a row changed over time: the previous values, the new values, the operation, and the who and when. The current table alone cannot answer that, because an UPDATE or DELETE overwrites the past. The question is really about where you capture the change and how much of that plumbing you write yourself.
There are three common designs, from most manual to most automatic:
Orders_Audit) and attach AFTER INSERT, UPDATE and DELETE triggers that copy the affected rows into it. Inside a trigger the special inserted and deleted tables give you the new and old versions, so you can store both alongside the operation, SUSER_SNAME(), and SYSUTCDATETIME(). Maximum control, but you own the code and it can slow down heavy writes.datetime2 period columns and a linked history table, and every change is versioned automatically. You then query the past with FOR SYSTEM_TIME. Little code, but you get a full row copy per version, so storage grows.The trade-offs are storage, performance, and how much you must maintain. Triggers are flexible and let you record exactly the columns and metadata you care about, but they run inside the writing transaction and add overhead. Temporal tables are clean and reliable but keep a full copy of every version. CDC is decoupled and low-impact on writers but is heavier to set up and is aimed at data movement.
Audit history is closely related to soft deletes: both are ways to avoid losing information, and a well designed audit table also keeps the correct keys and constraints so you can trace a change back to the exact row it came from. See the Microsoft docs on temporal tables for the automatic option.
-- Audit table captures the change plus who and when
CREATE TABLE Orders_Audit (
audit_id BIGINT IDENTITY PRIMARY KEY,
order_id INT NOT NULL,
old_status VARCHAR(20) NULL,
new_status VARCHAR(20) NULL,
changed_by SYSNAME NOT NULL,
changed_at DATETIME2 NOT NULL
);
GO
CREATE TRIGGER trg_Orders_Audit
ON Orders
AFTER UPDATE
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO Orders_Audit
(order_id, old_status, new_status, changed_by, changed_at)
SELECT d.order_id, d.status, i.status,
SUSER_SNAME(), SYSUTCDATETIME()
FROM inserted AS i
JOIN deleted AS d ON d.order_id = i.order_id
WHERE i.status <> d.status; -- only log real changes
END;
The deleted table holds the pre-update row and inserted holds the new one, so you can store both sides of the change.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
status VARCHAR(20) NOT NULL,
total DECIMAL(10,2) NOT NULL,
valid_from DATETIME2 GENERATED ALWAYS AS ROW START NOT NULL,
valid_to DATETIME2 GENERATED ALWAYS AS ROW END NOT NULL,
PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
)
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_History));
-- Later: see what every order looked like at a point in time
SELECT order_id, status, total
FROM Orders
FOR SYSTEM_TIME AS OF '2026-01-01T00:00:00';
SQL Server writes prior versions to Orders_History on every change; FOR SYSTEM_TIME queries the past with no extra triggers.
Related: What is a trigger and when should you use one? · What is the difference between a soft delete and a hard delete? · How do you enforce referential integrity?
Explore the other interview categories, or drill the fundamentals with the functions library and the SQL Server error library.