Skip to main content
MySQL powers the majority of production backends, and understanding its internals lets you write faster queries, design better schemas, and operate your databases with confidence. This page covers the four areas that matter most in practice: how indexes work at the storage level, how InnoDB implements transaction isolation without locks, how the three-log system guarantees durability, and how replication propagates changes across nodes.

Index internals

B+ tree structure, clustered vs. secondary indexes, covering indexes, and the leftmost prefix rule.

Transactions and MVCC

ACID guarantees, isolation levels, undo log version chains, and ReadView mechanics.

InnoDB log system

Undo log, redo log, and binlog — what each one does, when it flushes, and how two-phase commit ties them together.

Replication

Master-slave architecture, binlog formats, sync vs. async vs. semi-sync models, and relay log design.

Index internals

MySQL performs a full table scan when no index is available, which causes repeated disk I/O for every row that does not match your predicate. InnoDB reads 16 KB per I/O by exploiting locality — but a well-designed index reduces the number of I/O operations to a logarithmic minimum.

Why B+ tree

The engine considered several data structures before settling on B+ tree:
StructureProblem
Hash tableEquality-only; no range queries; hash collisions grow chains
Sorted arrayFast reads; catastrophically slow inserts and deletes
Binary / AVL / Red-black treeDepth grows with row count; data scattered across memory, violating locality
B-treeBetter than trees, but internal nodes carry data — range scans still require multiple I/Os
B+ treeNon-leaf nodes store only keys; leaves form a doubly-linked sorted list — one I/O for range start, pointer traversal for the rest
B+ tree properties that matter for MySQL:
  • Non-leaf nodes hold only index keys and child pointers, so more keys fit in each 16 KB page, reducing tree height.
  • Leaf nodes form a doubly-linked sorted list, supporting both ascending and descending range scans with a single seek.
  • The root is typically cached in the buffer pool, so most queries touch disk only for the leaf level.

Clustered vs. secondary indexes

InnoDB uses a clustered index (also called the primary key index) where the leaf nodes contain the complete row data. Every table has exactly one clustered index; if you do not define a primary key, InnoDB silently creates a hidden auto-increment DB_ROW_ID column as the key. A secondary index (non-clustered index) stores the secondary key value alongside the primary key value in its leaf nodes — not the row data itself. When you query through a secondary index, InnoDB traverses that B+ tree to find the primary key, then performs a table lookup (回表) on the clustered index tree to fetch the full row.
-- This query uses the secondary index on (age, sex), then does a table lookup
SELECT name, email FROM users WHERE age = 30 AND sex = 'M';

-- This query uses a covering index — id is already in the leaf of the secondary index
SELECT id FROM users WHERE age = 30 AND sex = 'M';
MyISAM uses a non-clustered model where leaf nodes store the physical disk address of each row rather than the primary key. Both primary and secondary indexes require an address-based lookup, but because the address is direct, MyISAM lookups can be faster than InnoDB’s two-step clustered lookup in read-heavy workloads.

Composite indexes and the leftmost prefix rule

A composite index on (a, b, c) sorts rows first by a, then by b within each a group, then by c. The optimizer can use the index for any prefix of that sort order:
  • WHERE a = ? — uses index
  • WHERE a = ? AND b = ? — uses index
  • WHERE a = ? AND b = ? AND c = ? — uses index
  • WHERE a = ? AND c = ? — uses index for a only; c requires a table scan within matching rows
  • WHERE b = ? — cannot use index at all
Range conditions (>, <, BETWEEN, LIKE 'prefix%') consume a prefix position, so any column to the right of a range predicate cannot be resolved by the index.

Covering index

When every column in SELECT and WHERE is present in the index, InnoDB can return results from the index leaf alone — no table lookup required. This is a covering index and is one of the highest-impact query optimizations available.
-- Covering: id, age, sex all live in the (age, sex) secondary index leaf
SELECT id FROM users WHERE age = 30;

-- Not covering: name is not in the index; requires table lookup
SELECT name FROM users WHERE age = 30;

Transaction isolation and MVCC

InnoDB implements READ COMMITTED and REPEATABLE READ without locking reads by using Multi-Version Concurrency Control (MVCC). Read operations never block write operations and vice versa, which reduces contention and eliminates a large class of deadlocks.

ACID refresher

PropertyGuaranteeInnoDB mechanism
AtomicityAll operations succeed or all roll backUndo log
ConsistencyDatabase moves from one valid state to anotherConstraints + atomicity
IsolationConcurrent transactions see a consistent snapshotMVCC + locks
DurabilityCommitted data survives crashesRedo log

Hidden columns

Every InnoDB row carries three hidden columns that MVCC depends on:
  • DB_TRX_ID — the transaction ID of the last transaction that modified this row
  • DB_ROLL_PTR — a pointer to the previous version of this row in the undo log
  • DB_ROW_ID — an auto-increment hidden primary key (present only when no explicit primary key is defined)

Undo log and the version chain

Every time a transaction modifies a row, InnoDB writes the old column values to the undo log and sets DB_ROLL_PTR on the current row to point back to that undo record. Repeated modifications chain together:
current row  →  undo record (v3)  →  undo record (v2)  →  undo record (v1, original)
The head of the chain is always the latest value; the tail is the oldest committed version. MVCC reads walk this chain until they find a version that is visible under the current isolation level.

ReadView

When a transaction performs a snapshot read (SELECT without FOR UPDATE), InnoDB creates a ReadView containing four values:
FieldMeaning
idsList of all active (uncommitted) transaction IDs at snapshot time
creator_trx_idTransaction ID of the current transaction
min_trx_idSmallest ID in ids
max_trx_idNext transaction ID that has not yet started
A row version with DB_TRX_ID = X is visible to the ReadView if:
  • X < min_trx_id — committed before the snapshot
  • X == creator_trx_id — written by the current transaction
  • X is not in ids and X < max_trx_id — committed after snapshot creation but before any active transaction
If none of these conditions hold, InnoDB follows DB_ROLL_PTR to the next older version and checks again.

Isolation levels

No MVCC. Reads go directly to the latest row data regardless of commit status. Exposes dirty reads (reading uncommitted changes that may be rolled back).
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

InnoDB log system

InnoDB maintains three distinct log types, each serving a different durability or replication purpose.

Undo log

The undo log records before-images — the original values of rows before a transaction modifies them. InnoDB writes undo records when a transaction opens, not when it commits. Two uses:
  1. Rollback: if a transaction aborts, InnoDB replays the undo records in reverse to restore original values.
  2. MVCC snapshots: concurrent readers use undo records to reconstruct older row versions without blocking writers.

Redo log

The redo log records after-images of changes to InnoDB data pages. It exists to survive crash recovery: if MySQL crashes between when a page is modified in the buffer pool and when that page is flushed to disk, the redo log lets InnoDB replay the modification on restart. InnoDB uses a circular buffer of fixed-size redo log files managed by two pointers:
  • Write Pos — where the next redo record will be written
  • Checkpoint — the oldest record not yet flushed to data files
When Write Pos catches up to Checkpoint, MySQL stalls new writes, flushes dirty pages to disk, and advances Checkpoint to reclaim space. Redo log flushing is controlled by innodb_flush_log_at_trx_commit:
ValueBehaviorSafety
0Buffer only; OS flushes when it wantsLow — up to 1 second of data loss
1 (default)fsync to disk on every commitHigh — no data loss
2Write to OS page cache on every commit; OS flushesMedium — up to 1 second loss on OS crash

Binlog

The binlog is a Server-layer log (not storage-engine-specific) that records every data-modifying statement and DDL executed against the database. Its two primary uses are point-in-time recovery and replication. Binlog flushing is controlled by sync_binlog:
ValueBehavior
0 (default)Write to OS cache; OS decides when to flush
1fsync on every transaction commit
Nfsync every N transactions

Two-phase commit

Redo log and binlog are independent. If a crash occurs after one but before the other flushes, their contents diverge and replication breaks. InnoDB prevents this with two-phase commit (2PC):
  1. Prepare phase: InnoDB writes the redo log and marks it PREPARE.
  2. Commit phase: Server writes binlog, then InnoDB marks the redo log COMMIT.
On recovery, if the redo log is in PREPARE state and a matching binlog entry exists, MySQL completes the commit. If the binlog entry is missing, MySQL rolls back.
Transaction executes
  → undo log written (old values)
  → redo log written to buffer (new values)
  → [PREPARE] redo log flushed to disk
  → binlog written and flushed
  → [COMMIT] redo log marked committed
Setting both innodb_flush_log_at_trx_commit=1 and sync_binlog=1 gives you the strongest durability guarantee but incurs two fsync calls per commit. For write-heavy workloads, evaluate sync_binlog=N as a throughput trade-off.

Replication

MySQL replication uses the binlog to propagate changes from a primary (master) to one or more replicas (slaves).

How replication works

  1. The primary records each committed transaction to the binlog.
  2. A replica’s I/O thread connects to the primary, requests binlog events, and writes them to a local relay log.
  3. A replica’s SQL thread reads the relay log and replays each event against the replica’s storage engine.
The relay log decouples network I/O from SQL execution, lets replicas buffer events during slowdowns, and allows cascading replication (a replica acting as primary to further replicas).

Replication models

The primary commits and returns to the client immediately without waiting for any replica to acknowledge. Highest performance, but if the primary crashes before a replica receives an event, that data is lost.

Replication lag

Replication lag is the time difference between when the primary commits a transaction and when the replica applies it. Common causes include:
  • Single-threaded SQL thread: older MySQL versions replay binlog events sequentially; a slow query on the replica stalls all subsequent events.
  • Write spikes: a bulk insert on the primary generates a large burst of binlog events that the replica’s SQL thread must catch up on.
  • Network bandwidth: a slow link between primary and replica limits how quickly relay log entries arrive.
Mitigations include multi-threaded replication (slave_parallel_workers), row-based binlog format for large batch operations, and dedicated network paths between primary and replica.

Binlog formats

FormatHow it recordsUse cases
STATEMENTSQL statement textHuman-readable; small log size; non-deterministic functions can cause divergence
ROWBefore and after images of each affected rowDeterministic; larger log size; required for some DDL and non-deterministic statements
MIXEDSTATEMENT by default; switches to ROW for non-deterministic statementsBalanced default for most workloads
Use ROW format if you subscribe to the binlog for change-data-capture (CDC) pipelines (e.g., Canal, Debezium). Statement-based events are harder to parse reliably and may not carry enough context for downstream consumers.