Index internals
B+ tree structure, clustered vs. secondary indexes, covering indexes, and the leftmost prefix rule.
Transactions and MVCC
ACID guarantees, isolation levels, undo log version chains, and ReadView mechanics.
InnoDB log system
Undo log, redo log, and binlog — what each one does, when it flushes, and how two-phase commit ties them together.
Replication
Master-slave architecture, binlog formats, sync vs. async vs. semi-sync models, and relay log design.
Index internals
MySQL performs a full table scan when no index is available, which causes repeated disk I/O for every row that does not match your predicate. InnoDB reads 16 KB per I/O by exploiting locality — but a well-designed index reduces the number of I/O operations to a logarithmic minimum.Why B+ tree
The engine considered several data structures before settling on B+ tree:| Structure | Problem |
|---|---|
| Hash table | Equality-only; no range queries; hash collisions grow chains |
| Sorted array | Fast reads; catastrophically slow inserts and deletes |
| Binary / AVL / Red-black tree | Depth grows with row count; data scattered across memory, violating locality |
| B-tree | Better than trees, but internal nodes carry data — range scans still require multiple I/Os |
| B+ tree | Non-leaf nodes store only keys; leaves form a doubly-linked sorted list — one I/O for range start, pointer traversal for the rest |
- Non-leaf nodes hold only index keys and child pointers, so more keys fit in each 16 KB page, reducing tree height.
- Leaf nodes form a doubly-linked sorted list, supporting both ascending and descending range scans with a single seek.
- The root is typically cached in the buffer pool, so most queries touch disk only for the leaf level.
Clustered vs. secondary indexes
InnoDB uses a clustered index (also called the primary key index) where the leaf nodes contain the complete row data. Every table has exactly one clustered index; if you do not define a primary key, InnoDB silently creates a hidden auto-incrementDB_ROW_ID column as the key.
A secondary index (non-clustered index) stores the secondary key value alongside the primary key value in its leaf nodes — not the row data itself. When you query through a secondary index, InnoDB traverses that B+ tree to find the primary key, then performs a table lookup (回表) on the clustered index tree to fetch the full row.
MyISAM uses a non-clustered model where leaf nodes store the physical disk address of each row rather than the primary key. Both primary and secondary indexes require an address-based lookup, but because the address is direct, MyISAM lookups can be faster than InnoDB’s two-step clustered lookup in read-heavy workloads.
Composite indexes and the leftmost prefix rule
A composite index on(a, b, c) sorts rows first by a, then by b within each a group, then by c. The optimizer can use the index for any prefix of that sort order:
WHERE a = ?— uses indexWHERE a = ? AND b = ?— uses indexWHERE a = ? AND b = ? AND c = ?— uses indexWHERE a = ? AND c = ?— uses index foraonly;crequires a table scan within matching rowsWHERE b = ?— cannot use index at all
>, <, BETWEEN, LIKE 'prefix%') consume a prefix position, so any column to the right of a range predicate cannot be resolved by the index.
Covering index
When every column inSELECT and WHERE is present in the index, InnoDB can return results from the index leaf alone — no table lookup required. This is a covering index and is one of the highest-impact query optimizations available.
Transaction isolation and MVCC
InnoDB implementsREAD COMMITTED and REPEATABLE READ without locking reads by using Multi-Version Concurrency Control (MVCC). Read operations never block write operations and vice versa, which reduces contention and eliminates a large class of deadlocks.
ACID refresher
| Property | Guarantee | InnoDB mechanism |
|---|---|---|
| Atomicity | All operations succeed or all roll back | Undo log |
| Consistency | Database moves from one valid state to another | Constraints + atomicity |
| Isolation | Concurrent transactions see a consistent snapshot | MVCC + locks |
| Durability | Committed data survives crashes | Redo log |
Hidden columns
Every InnoDB row carries three hidden columns that MVCC depends on:DB_TRX_ID— the transaction ID of the last transaction that modified this rowDB_ROLL_PTR— a pointer to the previous version of this row in the undo logDB_ROW_ID— an auto-increment hidden primary key (present only when no explicit primary key is defined)
Undo log and the version chain
Every time a transaction modifies a row, InnoDB writes the old column values to the undo log and setsDB_ROLL_PTR on the current row to point back to that undo record. Repeated modifications chain together:
ReadView
When a transaction performs a snapshot read (SELECT without FOR UPDATE), InnoDB creates a ReadView containing four values:
| Field | Meaning |
|---|---|
ids | List of all active (uncommitted) transaction IDs at snapshot time |
creator_trx_id | Transaction ID of the current transaction |
min_trx_id | Smallest ID in ids |
max_trx_id | Next transaction ID that has not yet started |
DB_TRX_ID = X is visible to the ReadView if:
X < min_trx_id— committed before the snapshotX == creator_trx_id— written by the current transactionXis not inidsandX < max_trx_id— committed after snapshot creation but before any active transaction
DB_ROLL_PTR to the next older version and checks again.
Isolation levels
- Read uncommitted
- Read committed
- Repeatable read (default)
- Serializable
No MVCC. Reads go directly to the latest row data regardless of commit status. Exposes dirty reads (reading uncommitted changes that may be rolled back).
InnoDB log system
InnoDB maintains three distinct log types, each serving a different durability or replication purpose.Undo log
The undo log records before-images — the original values of rows before a transaction modifies them. InnoDB writes undo records when a transaction opens, not when it commits. Two uses:- Rollback: if a transaction aborts, InnoDB replays the undo records in reverse to restore original values.
- MVCC snapshots: concurrent readers use undo records to reconstruct older row versions without blocking writers.
Redo log
The redo log records after-images of changes to InnoDB data pages. It exists to survive crash recovery: if MySQL crashes between when a page is modified in the buffer pool and when that page is flushed to disk, the redo log lets InnoDB replay the modification on restart. InnoDB uses a circular buffer of fixed-size redo log files managed by two pointers:- Write Pos — where the next redo record will be written
- Checkpoint — the oldest record not yet flushed to data files
Write Pos catches up to Checkpoint, MySQL stalls new writes, flushes dirty pages to disk, and advances Checkpoint to reclaim space.
Redo log flushing is controlled by innodb_flush_log_at_trx_commit:
| Value | Behavior | Safety |
|---|---|---|
0 | Buffer only; OS flushes when it wants | Low — up to 1 second of data loss |
1 (default) | fsync to disk on every commit | High — no data loss |
2 | Write to OS page cache on every commit; OS flushes | Medium — up to 1 second loss on OS crash |
Binlog
The binlog is a Server-layer log (not storage-engine-specific) that records every data-modifying statement and DDL executed against the database. Its two primary uses are point-in-time recovery and replication. Binlog flushing is controlled bysync_binlog:
| Value | Behavior |
|---|---|
0 (default) | Write to OS cache; OS decides when to flush |
1 | fsync on every transaction commit |
N | fsync every N transactions |
Two-phase commit
Redo log and binlog are independent. If a crash occurs after one but before the other flushes, their contents diverge and replication breaks. InnoDB prevents this with two-phase commit (2PC):- Prepare phase: InnoDB writes the redo log and marks it
PREPARE. - Commit phase: Server writes binlog, then InnoDB marks the redo log
COMMIT.
PREPARE state and a matching binlog entry exists, MySQL completes the commit. If the binlog entry is missing, MySQL rolls back.
Replication
MySQL replication uses the binlog to propagate changes from a primary (master) to one or more replicas (slaves).How replication works
- The primary records each committed transaction to the binlog.
- A replica’s I/O thread connects to the primary, requests binlog events, and writes them to a local relay log.
- A replica’s SQL thread reads the relay log and replays each event against the replica’s storage engine.
Replication models
- Async (default)
- Semi-sync
- Sync
The primary commits and returns to the client immediately without waiting for any replica to acknowledge. Highest performance, but if the primary crashes before a replica receives an event, that data is lost.
Replication lag
Replication lag is the time difference between when the primary commits a transaction and when the replica applies it. Common causes include:- Single-threaded SQL thread: older MySQL versions replay binlog events sequentially; a slow query on the replica stalls all subsequent events.
- Write spikes: a bulk insert on the primary generates a large burst of binlog events that the replica’s SQL thread must catch up on.
- Network bandwidth: a slow link between primary and replica limits how quickly relay log entries arrive.
slave_parallel_workers), row-based binlog format for large batch operations, and dedicated network paths between primary and replica.
Binlog formats
| Format | How it records | Use cases |
|---|---|---|
STATEMENT | SQL statement text | Human-readable; small log size; non-deterministic functions can cause divergence |
ROW | Before and after images of each affected row | Deterministic; larger log size; required for some DDL and non-deterministic statements |
MIXED | STATEMENT by default; switches to ROW for non-deterministic statements | Balanced default for most workloads |