Recruiters now not ask you to recite the six regular types. They need to hear you cause about information at 2 a.m. when the first shard is scorching and the CFO is responding to the stakeholders. The questions you’ll encounter on this article have been harvested from actual interviews at Google, Amazon, Stripe, Snowflake, and a handful of YC unicorns. Every reply is lengthy sufficient to sql muscle reminiscence, brief sufficient to slot in the conversational window earlier than the interviewer nods or interrupts. Use these DBMS interview questions as a guidelines, and a non-exhaustive one at that.
Metric for Segregation
I’ve categorized the questions into three classes. Every class is tailor-made to a particular expertise stage and progressively goes up. The listing comprises a mixture of theoretical questions which are requested throughout an interview, and a few hands-on additions, to maintain the pragmatics.

Newcomers
These questions are related for these nonetheless studying the ropes.
Q1. What’s a main key, and why can’t we simply use ROWID
?
A. A main secret is a logical, distinctive identifier chosen by the designer. ROWID
(or CTID, _id, and many others.) is a bodily locator maintained by the engine and might change after upkeep operations resembling VACUUM, cluster re-ordering, or shard re-balancing. Exposing a bodily pointer would break foreign-key relationships the second the storage layer reorganises pages. A main key, against this, is immutable and moveable throughout storage engines, which is strictly what referential integrity wants.
Q2. Clarify logical information independence vs bodily information independence.
A. Logical information independence means you possibly can change the logical schema (e.g., including attributes or new tables) with out rewriting software packages. Bodily information independence means you possibly can change the storage construction (e.g., indexes, file group) with out affecting the logical schema or queries.
Q3. Outline 1NF, 2NF, and 3NF in a single paragraph every, then inform me which one you’d calm down first for analytics.
A. 1NF: each column comprises atomic, indivisible values (no arrays or nested tables). 2NF: 1NF plus each non-key column is totally depending on the whole main key (no partial dependency). 3NF: 2NF plus no transitive dependency—non-key columns could not rely on different non-key columns. In star-schema analytics, we normally drop 3NF first: we fortunately duplicate the client’s phase identify within the reality desk to avoid wasting a be a part of, accepting replace anomalies for learn velocity.
This fall. What’s the distinction between a schema and an occasion in a DBMS?
A. The schema is the database’s general design (its blueprint), normally mounted and barely modified. The occasion is the precise content material of the database at a given second. The schema is secure; the occasion modifications each time information is up to date.
Q5. State the 4 ACID properties and provides a one-sentence warfare story that violates every.
A. Atomicity: a debit posts, however the credit score disappears, and the cash vanishes. Consistency: a adverse steadiness is written; the verify constraint fires, and the entire transaction rolls again. Isolation: two concurrent bookings seize the final seat; each commit, resulting in an oversold flight. Sturdiness: commit returns success, energy fails, write-ahead log is on the corrupted SSD, resulting in information loss.
Q6. What are the various kinds of information fashions in DBMS?
A. Widespread fashions embody:
- Object-oriented mannequin (objects, lessons, inheritance).
- Hierarchical mannequin (tree construction, parent-child).
- Community mannequin (data related by hyperlinks).
- Relational mannequin (tables, keys, relationships).
- Entity-Relationship mannequin (high-level conceptual).
You’ve got some expertise with Databases.
Q7. What’s a impasse in DBMS? How can or not it’s dealt with?
A. Impasse happens when two transactions every maintain a useful resource and anticipate the opposite’s useful resource, blocking ceaselessly. Options:
- Avoidance (Banker’s algorithm).
- Prevention (lock ordering, timeouts).
- Detection (wait-for graph, cycle detection).
Q8. What’s checkpointing in DBMS restoration?
A. A checkpoint is a marker the place the DBMS flushes soiled pages and logs to secure storage. Throughout crash restoration, the system can begin from the final checkpoint as a substitute of scanning all the log, making restoration sooner.
Q9. What does the optimizer actually do throughout a cost-based be a part of alternative between nested-loop, hash, and merge?
A. It estimates the cardinality of every baby, consults column statistics (commonest values, histograms), considers accessible reminiscence (work_mem), indexes, and types. If the outer aspect is tiny (after filters) and the inside aspect has a selective index, nested-loop wins. Either side are massive and unsorted, which builds an in-memory hash desk (hash be a part of). If each are already sorted (index scan or earlier kind step), merge be a part of is O(n+m) and memory-cheap. The ultimate value quantity is I/O + CPU weighted by empirical constants saved in pg_statistic or mysql.column_stats
.
Q10. Clarify phantom learn and which isolation stage prevents it.
A. Transaction A runs SELECT SUM(quantity) WHERE standing="PENDING"
twice; between runs, transaction B inserts a brand new pending row. A sees a unique whole—phantom. Solely SERIALIZABLE (or Snapshot Isolation with predicate locks) prevents phantoms; REPEATABLE READ does not (opposite to folklore in MySQL).
Superior
You’ve deleted manufacturing information and lived by way of that.
Q11. Your 2 TB desk have to be sharded. Give the actual shard-key resolution tree you’d defend to the CTO.
A. 1: Listing the highest 10 queries by frequency and by bandwidth—shard should fulfill each.
2: Select a high-cardinality, uniformly distributed column (user_id
, not country_code
).
3: Make sure the column seems in each multi-row transaction; in any other case, two-phase commit turns into inevitable.
4: verify for hot-spot danger (e.g., one celeb person) — use hash-shard + per-shard autoincrement, not range-shard.
5: Show you possibly can re-shard on-line with logical replication; current a dry-run cut-over script. Solely when all 5 packing containers are ticked do you signal the design doc.
Q12. Stroll me by way of the interior steps PostgreSQL takes from INSERT
assertion to a sturdy disk byte.
A. 1: Parser → uncooked parse tree.
2: Analyzer → question tree with sorts.
3: Planner → one-node ModifyTable plan.
4: Executor grabs a buffer pin on the goal web page, inserts the tuple, and units xmin/xmax system columns.
5: WAL report inserted into shared buffers in reminiscence.
6: COMMIT writes WAL to disk by way of XLogWrite—now crash-safe.
7: Background author later flushes soiled information pages; if the server dies earlier than that, redo restoration replays WAL. Sturdiness is assured at step 6, not step 7.
Q13. Design a bitemporal desk that retains legitimate time (when the actual fact was true in actuality) and transaction time (when the database knew it). Write the first key and the SQL to right a retroactive value change.
A. Main key: (product_id, valid_from, transaction_from). Correction is an append-only insert with a brand new transaction_from; no UPDATEs.
INSERT INTO value(product_id, value, valid_from, valid_to, transaction_from, transaction_to)
VALUES (42, 19.99, '2025-07-01', '2025-12-31', now(), '9999-12-31');
To finish the earlier incorrect assertion:
UPDATE value SET transaction_to = now()
WHERE product_id = 42 AND valid_from '2025-07-01'
AND transaction_to = '9999-12-31';
Selects now use FOR SYSTEM_TIME AS OF
and BETWEEN valid_from AND valid_to
to retrieve the right temporal slice.
Conclusion
The listing consists of an eclectic mixture of questions from hands-on to purely theoretical. What this actually means is you’re being examined on DBMS pondering, not syntax: keys and normalization, ACID and isolation anomalies, question planning, restoration and WAL, deadlocks, shard-key technique, Postgres internals, and bitemporal modeling. The purpose is to floor trade-offs, invariants, failure modes, and operational judgment.
Skip memorizing clauses. Present why main keys outlive ROWIDs, when REPEATABLE READ nonetheless leaks phantoms, why a hash be a part of beats nested loops, and the way you’d reshard with out downtime. For those who can stroll by way of these selections out loud, you’ll come throughout as an information methods engineer.
For those who actually wanna undergo all that might be requested for in an interview of a database engineer, undergo the next sources:
Login to proceed studying and luxuriate in expert-curated content material.