MySQL Deep Dive: Indexes, Transactions, and Replication

MySQL powers the majority of production backends, and understanding its internals lets you write faster queries, design better schemas, and operate your databases with confidence. This page covers the four areas that matter most in practice: how indexes work at the storage level, how InnoDB implements transaction isolation without locks, how the three-log system guarantees durability, and how replication propagates changes across nodes.

Index internals

B+ tree structure, clustered vs. secondary indexes, covering indexes, and the leftmost prefix rule.

Transactions and MVCC

ACID guarantees, isolation levels, undo log version chains, and ReadView mechanics.

InnoDB log system

Undo log, redo log, and binlog — what each one does, when it flushes, and how two-phase commit ties them together.

Replication

Master-slave architecture, binlog formats, sync vs. async vs. semi-sync models, and relay log design.

Index internals

MySQL performs a full table scan when no index is available, which causes repeated disk I/O for every row that does not match your predicate. InnoDB reads 16 KB per I/O by exploiting locality — but a well-designed index reduces the number of I/O operations to a logarithmic minimum.

Why B+ tree

The engine considered several data structures before settling on B+ tree:

Structure	Problem
Hash table	Equality-only; no range queries; hash collisions grow chains
Sorted array	Fast reads; catastrophically slow inserts and deletes
Binary / AVL / Red-black tree	Depth grows with row count; data scattered across memory, violating locality
B-tree	Better than trees, but internal nodes carry data — range scans still require multiple I/Os
B+ tree	Non-leaf nodes store only keys; leaves form a doubly-linked sorted list — one I/O for range start, pointer traversal for the rest

B+ tree properties that matter for MySQL:

Non-leaf nodes hold only index keys and child pointers, so more keys fit in each 16 KB page, reducing tree height.
Leaf nodes form a doubly-linked sorted list, supporting both ascending and descending range scans with a single seek.
The root is typically cached in the buffer pool, so most queries touch disk only for the leaf level.

Clustered vs. secondary indexes

InnoDB uses a clustered index (also called the primary key index) where the leaf nodes contain the complete row data. Every table has exactly one clustered index; if you do not define a primary key, InnoDB silently creates a hidden auto-increment DB_ROW_ID column as the key. A secondary index (non-clustered index) stores the secondary key value alongside the primary key value in its leaf nodes — not the row data itself. When you query through a secondary index, InnoDB traverses that B+ tree to find the primary key, then performs a table lookup (回表) on the clustered index tree to fetch the full row.

-- This query uses the secondary index on (age, sex), then does a table lookup
SELECT name, email FROM users WHERE age = 30 AND sex = 'M';

-- This query uses a covering index — id is already in the leaf of the secondary index
SELECT id FROM users WHERE age = 30 AND sex = 'M';

MyISAM uses a non-clustered model where leaf nodes store the physical disk address of each row rather than the primary key. Both primary and secondary indexes require an address-based lookup, but because the address is direct, MyISAM lookups can be faster than InnoDB’s two-step clustered lookup in read-heavy workloads.

Composite indexes and the leftmost prefix rule

A composite index on (a, b, c) sorts rows first by a, then by b within each a group, then by c. The optimizer can use the index for any prefix of that sort order:

WHERE a = ? — uses index
WHERE a = ? AND b = ? — uses index
WHERE a = ? AND b = ? AND c = ? — uses index
WHERE a = ? AND c = ? — uses index for a only; c requires a table scan within matching rows
WHERE b = ? — cannot use index at all

Range conditions (>, <, BETWEEN, LIKE 'prefix%') consume a prefix position, so any column to the right of a range predicate cannot be resolved by the index.

Covering index

When every column in SELECT and WHERE is present in the index, InnoDB can return results from the index leaf alone — no table lookup required. This is a covering index and is one of the highest-impact query optimizations available.

-- Covering: id, age, sex all live in the (age, sex) secondary index leaf
SELECT id FROM users WHERE age = 30;

-- Not covering: name is not in the index; requires table lookup
SELECT name FROM users WHERE age = 30;

Transaction isolation and MVCC

InnoDB implements READ COMMITTED and REPEATABLE READ without locking reads by using Multi-Version Concurrency Control (MVCC). Read operations never block write operations and vice versa, which reduces contention and eliminates a large class of deadlocks.

ACID refresher

Property	Guarantee	InnoDB mechanism
Atomicity	All operations succeed or all roll back	Undo log
Consistency	Database moves from one valid state to another	Constraints + atomicity
Isolation	Concurrent transactions see a consistent snapshot	MVCC + locks
Durability	Committed data survives crashes	Redo log

Hidden columns

Every InnoDB row carries three hidden columns that MVCC depends on:

DB_TRX_ID — the transaction ID of the last transaction that modified this row
DB_ROLL_PTR — a pointer to the previous version of this row in the undo log
DB_ROW_ID — an auto-increment hidden primary key (present only when no explicit primary key is defined)

Undo log and the version chain

Every time a transaction modifies a row, InnoDB writes the old column values to the undo log and sets DB_ROLL_PTR on the current row to point back to that undo record. Repeated modifications chain together:

current row  →  undo record (v3)  →  undo record (v2)  →  undo record (v1, original)

The head of the chain is always the latest value; the tail is the oldest committed version. MVCC reads walk this chain until they find a version that is visible under the current isolation level.

ReadView

When a transaction performs a snapshot read (SELECT without FOR UPDATE), InnoDB creates a ReadView containing four values:

Field	Meaning
`ids`	List of all active (uncommitted) transaction IDs at snapshot time
`creator_trx_id`	Transaction ID of the current transaction
`min_trx_id`	Smallest ID in `ids`
`max_trx_id`	Next transaction ID that has not yet started

A row version with DB_TRX_ID = X is visible to the ReadView if:

X < min_trx_id — committed before the snapshot
X == creator_trx_id — written by the current transaction
X is not in ids and X < max_trx_id — committed after snapshot creation but before any active transaction

If none of these conditions hold, InnoDB follows DB_ROLL_PTR to the next older version and checks again.

Isolation levels

Read uncommitted
Read committed
Repeatable read (default)
Serializable

No MVCC. Reads go directly to the latest row data regardless of commit status. Exposes dirty reads (reading uncommitted changes that may be rolled back).

SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

A new ReadView is created for each statement within the transaction. You always read the latest committed data, but two identical SELECT statements within the same transaction may return different rows if another transaction commits in between (non-repeatable read).

SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;

A single ReadView is created at the start of the first snapshot read and reused for the entire transaction. The same SELECT always returns the same rows regardless of concurrent commits. InnoDB also uses gap locks to prevent phantom reads in most cases.

SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;

All reads are implicitly converted to SELECT ... FOR SHARE. Transactions execute in strict serial order. Eliminates all anomalies but dramatically reduces concurrency.

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;

InnoDB log system

InnoDB maintains three distinct log types, each serving a different durability or replication purpose.

Undo log

The undo log records before-images — the original values of rows before a transaction modifies them. InnoDB writes undo records when a transaction opens, not when it commits. Two uses:

Rollback: if a transaction aborts, InnoDB replays the undo records in reverse to restore original values.
MVCC snapshots: concurrent readers use undo records to reconstruct older row versions without blocking writers.

Redo log

The redo log records after-images of changes to InnoDB data pages. It exists to survive crash recovery: if MySQL crashes between when a page is modified in the buffer pool and when that page is flushed to disk, the redo log lets InnoDB replay the modification on restart. InnoDB uses a circular buffer of fixed-size redo log files managed by two pointers:

Write Pos — where the next redo record will be written
Checkpoint — the oldest record not yet flushed to data files

When Write Pos catches up to Checkpoint, MySQL stalls new writes, flushes dirty pages to disk, and advances Checkpoint to reclaim space. Redo log flushing is controlled by innodb_flush_log_at_trx_commit:

Value	Behavior	Safety
`0`	Buffer only; OS flushes when it wants	Low — up to 1 second of data loss
`1` (default)	`fsync` to disk on every commit	High — no data loss
`2`	Write to OS page cache on every commit; OS flushes	Medium — up to 1 second loss on OS crash

Binlog

The binlog is a Server-layer log (not storage-engine-specific) that records every data-modifying statement and DDL executed against the database. Its two primary uses are point-in-time recovery and replication. Binlog flushing is controlled by sync_binlog:

Value	Behavior
`0` (default)	Write to OS cache; OS decides when to flush
`1`	`fsync` on every transaction commit
`N`	`fsync` every N transactions

Two-phase commit

Redo log and binlog are independent. If a crash occurs after one but before the other flushes, their contents diverge and replication breaks. InnoDB prevents this with two-phase commit (2PC):

Prepare phase: InnoDB writes the redo log and marks it PREPARE.
Commit phase: Server writes binlog, then InnoDB marks the redo log COMMIT.

On recovery, if the redo log is in PREPARE state and a matching binlog entry exists, MySQL completes the commit. If the binlog entry is missing, MySQL rolls back.

Transaction executes
  → undo log written (old values)
  → redo log written to buffer (new values)
  → [PREPARE] redo log flushed to disk
  → binlog written and flushed
  → [COMMIT] redo log marked committed

Setting both innodb_flush_log_at_trx_commit=1 and sync_binlog=1 gives you the strongest durability guarantee but incurs two fsync calls per commit. For write-heavy workloads, evaluate sync_binlog=N as a throughput trade-off.

Replication

MySQL replication uses the binlog to propagate changes from a primary (master) to one or more replicas (slaves).

How replication works

The primary records each committed transaction to the binlog.
A replica’s I/O thread connects to the primary, requests binlog events, and writes them to a local relay log.
A replica’s SQL thread reads the relay log and replays each event against the replica’s storage engine.

The relay log decouples network I/O from SQL execution, lets replicas buffer events during slowdowns, and allows cascading replication (a replica acting as primary to further replicas).

Replication models

Async (default)
Semi-sync
Sync

The primary commits and returns to the client immediately without waiting for any replica to acknowledge. Highest performance, but if the primary crashes before a replica receives an event, that data is lost.

Replication lag

Replication lag is the time difference between when the primary commits a transaction and when the replica applies it. Common causes include:

Single-threaded SQL thread: older MySQL versions replay binlog events sequentially; a slow query on the replica stalls all subsequent events.
Write spikes: a bulk insert on the primary generates a large burst of binlog events that the replica’s SQL thread must catch up on.
Network bandwidth: a slow link between primary and replica limits how quickly relay log entries arrive.

Mitigations include multi-threaded replication (slave_parallel_workers), row-based binlog format for large batch operations, and dedicated network paths between primary and replica.

Binlog formats

Format	How it records	Use cases
`STATEMENT`	SQL statement text	Human-readable; small log size; non-deterministic functions can cause divergence
`ROW`	Before and after images of each affected row	Deterministic; larger log size; required for some DDL and non-deterministic statements
`MIXED`	`STATEMENT` by default; switches to `ROW` for non-deterministic statements	Balanced default for most workloads

Use ROW format if you subscribe to the binlog for change-data-capture (CDC) pipelines (e.g., Canal, Debezium). Statement-based events are harder to parse reliably and may not carry enough context for downstream consumers.

Get Started

Databases

Backend Engineering

Computer Science

Software Design

MySQL Deep Dive: Indexes, Transactions, and Replication

Index internals

Transactions and MVCC

InnoDB log system

Replication

Index internals

Why B+ tree

Clustered vs. secondary indexes

Composite indexes and the leftmost prefix rule

Covering index

Transaction isolation and MVCC

ACID refresher

Hidden columns

Undo log and the version chain

ReadView

Isolation levels

InnoDB log system

Undo log

Redo log

Binlog

Two-phase commit

Replication

How replication works

Replication models

Replication lag

Binlog formats

Get Started

Databases

Backend Engineering

Computer Science

Software Design

Index internals

Transactions and MVCC

InnoDB log system

Replication

​Index internals

​Why B+ tree

​Clustered vs. secondary indexes

​Composite indexes and the leftmost prefix rule

​Covering index

​Transaction isolation and MVCC

​ACID refresher

​Hidden columns

​Undo log and the version chain

​ReadView

​Isolation levels

​InnoDB log system

​Undo log

​Redo log

​Binlog

​Two-phase commit

​Replication

​How replication works

​Replication models

​Replication lag

​Binlog formats

Index internals

Why B+ tree

Clustered vs. secondary indexes

Composite indexes and the leftmost prefix rule

Covering index

Transaction isolation and MVCC

ACID refresher

Hidden columns

Undo log and the version chain

ReadView

Isolation levels

InnoDB log system

Undo log

Redo log

Binlog

Two-phase commit

Replication

How replication works

Replication models

Replication lag

Binlog formats