> ## Documentation Index > Fetch the complete documentation index at: https://docs.pebchip.top/docs/llms.txt > Use this file to discover all available pages before exploring further. # MySQL Deep Dive: Indexes, Transactions, and Replication > Comprehensive MySQL reference covering B+ tree indexes, MVCC transactions, InnoDB logging, replication, and query optimization techniques. MySQL powers the majority of production backends, and understanding its internals lets you write faster queries, design better schemas, and operate your databases with confidence. This page covers the four areas that matter most in practice: how indexes work at the storage level, how InnoDB implements transaction isolation without locks, how the three-log system guarantees durability, and how replication propagates changes across nodes. B+ tree structure, clustered vs. secondary indexes, covering indexes, and the leftmost prefix rule. ACID guarantees, isolation levels, undo log version chains, and ReadView mechanics. Undo log, redo log, and binlog — what each one does, when it flushes, and how two-phase commit ties them together. Master-slave architecture, binlog formats, sync vs. async vs. semi-sync models, and relay log design. ## Index internals MySQL performs a full table scan when no index is available, which causes repeated disk I/O for every row that does not match your predicate. InnoDB reads 16 KB per I/O by exploiting locality — but a well-designed index reduces the number of I/O operations to a logarithmic minimum. ### Why B+ tree The engine considered several data structures before settling on B+ tree: | Structure | Problem | | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | | Hash table | Equality-only; no range queries; hash collisions grow chains | | Sorted array | Fast reads; catastrophically slow inserts and deletes | | Binary / AVL / Red-black tree | Depth grows with row count; data scattered across memory, violating locality | | B-tree | Better than trees, but internal nodes carry data — range scans still require multiple I/Os | | **B+ tree** | Non-leaf nodes store only keys; leaves form a doubly-linked sorted list — one I/O for range start, pointer traversal for the rest | B+ tree properties that matter for MySQL: * Non-leaf nodes hold **only index keys and child pointers**, so more keys fit in each 16 KB page, reducing tree height. * Leaf nodes form a **doubly-linked sorted list**, supporting both ascending and descending range scans with a single seek. * The root is typically cached in the buffer pool, so most queries touch disk only for the leaf level. ### Clustered vs. secondary indexes InnoDB uses a **clustered index** (also called the primary key index) where the leaf nodes contain the complete row data. Every table has exactly one clustered index; if you do not define a primary key, InnoDB silently creates a hidden auto-increment `DB_ROW_ID` column as the key. A **secondary index** (non-clustered index) stores the secondary key value alongside the **primary key value** in its leaf nodes — not the row data itself. When you query through a secondary index, InnoDB traverses that B+ tree to find the primary key, then performs a **table lookup** (回表) on the clustered index tree to fetch the full row. ```sql theme={null} -- This query uses the secondary index on (age, sex), then does a table lookup SELECT name, email FROM users WHERE age = 30 AND sex = 'M'; -- This query uses a covering index — id is already in the leaf of the secondary index SELECT id FROM users WHERE age = 30 AND sex = 'M'; ``` MyISAM uses a non-clustered model where leaf nodes store the physical disk address of each row rather than the primary key. Both primary and secondary indexes require an address-based lookup, but because the address is direct, MyISAM lookups can be faster than InnoDB's two-step clustered lookup in read-heavy workloads. ### Composite indexes and the leftmost prefix rule A composite index on `(a, b, c)` sorts rows first by `a`, then by `b` within each `a` group, then by `c`. The optimizer can use the index for any prefix of that sort order: * `WHERE a = ?` — uses index * `WHERE a = ? AND b = ?` — uses index * `WHERE a = ? AND b = ? AND c = ?` — uses index * `WHERE a = ? AND c = ?` — uses index for `a` only; `c` requires a table scan within matching rows * `WHERE b = ?` — cannot use index at all Range conditions (`>`, `<`, `BETWEEN`, `LIKE 'prefix%'`) consume a prefix position, so any column to the right of a range predicate cannot be resolved by the index. ### Covering index When every column in `SELECT` and `WHERE` is present in the index, InnoDB can return results from the index leaf alone — no table lookup required. This is a **covering index** and is one of the highest-impact query optimizations available. ```sql theme={null} -- Covering: id, age, sex all live in the (age, sex) secondary index leaf SELECT id FROM users WHERE age = 30; -- Not covering: name is not in the index; requires table lookup SELECT name FROM users WHERE age = 30; ``` *** ## Transaction isolation and MVCC InnoDB implements `READ COMMITTED` and `REPEATABLE READ` without locking reads by using **Multi-Version Concurrency Control (MVCC)**. Read operations never block write operations and vice versa, which reduces contention and eliminates a large class of deadlocks. ### ACID refresher | Property | Guarantee | InnoDB mechanism | | ----------- | ------------------------------------------------- | ----------------------- | | Atomicity | All operations succeed or all roll back | Undo log | | Consistency | Database moves from one valid state to another | Constraints + atomicity | | Isolation | Concurrent transactions see a consistent snapshot | MVCC + locks | | Durability | Committed data survives crashes | Redo log | ### Hidden columns Every InnoDB row carries three hidden columns that MVCC depends on: * `DB_TRX_ID` — the transaction ID of the last transaction that modified this row * `DB_ROLL_PTR` — a pointer to the previous version of this row in the undo log * `DB_ROW_ID` — an auto-increment hidden primary key (present only when no explicit primary key is defined) ### Undo log and the version chain Every time a transaction modifies a row, InnoDB writes the old column values to the **undo log** and sets `DB_ROLL_PTR` on the current row to point back to that undo record. Repeated modifications chain together: ``` current row → undo record (v3) → undo record (v2) → undo record (v1, original) ``` The head of the chain is always the latest value; the tail is the oldest committed version. MVCC reads walk this chain until they find a version that is visible under the current isolation level. ### ReadView When a transaction performs a **snapshot read** (`SELECT` without `FOR UPDATE`), InnoDB creates a **ReadView** containing four values: | Field | Meaning | | ---------------- | ----------------------------------------------------------------- | | `ids` | List of all active (uncommitted) transaction IDs at snapshot time | | `creator_trx_id` | Transaction ID of the current transaction | | `min_trx_id` | Smallest ID in `ids` | | `max_trx_id` | Next transaction ID that has not yet started | A row version with `DB_TRX_ID = X` is **visible** to the ReadView if: * `X < min_trx_id` — committed before the snapshot * `X == creator_trx_id` — written by the current transaction * `X` is not in `ids` and `X < max_trx_id` — committed after snapshot creation but before any active transaction If none of these conditions hold, InnoDB follows `DB_ROLL_PTR` to the next older version and checks again. ### Isolation levels No MVCC. Reads go directly to the latest row data regardless of commit status. Exposes **dirty reads** (reading uncommitted changes that may be rolled back). ```sql theme={null} SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; ``` A new ReadView is created for **each statement** within the transaction. You always read the latest committed data, but two identical `SELECT` statements within the same transaction may return different rows if another transaction commits in between (**non-repeatable read**). ```sql theme={null} SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED; ``` A single ReadView is created at the **start of the first snapshot read** and reused for the entire transaction. The same `SELECT` always returns the same rows regardless of concurrent commits. InnoDB also uses gap locks to prevent **phantom reads** in most cases. ```sql theme={null} SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; ``` All reads are implicitly converted to `SELECT ... FOR SHARE`. Transactions execute in strict serial order. Eliminates all anomalies but dramatically reduces concurrency. ```sql theme={null} SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE; ``` *** ## InnoDB log system InnoDB maintains three distinct log types, each serving a different durability or replication purpose. ### Undo log The undo log records **before-images** — the original values of rows before a transaction modifies them. InnoDB writes undo records when a transaction opens, not when it commits. Two uses: 1. **Rollback**: if a transaction aborts, InnoDB replays the undo records in reverse to restore original values. 2. **MVCC snapshots**: concurrent readers use undo records to reconstruct older row versions without blocking writers. ### Redo log The redo log records **after-images** of changes to InnoDB data pages. It exists to survive crash recovery: if MySQL crashes between when a page is modified in the buffer pool and when that page is flushed to disk, the redo log lets InnoDB replay the modification on restart. InnoDB uses a circular buffer of fixed-size redo log files managed by two pointers: * **Write Pos** — where the next redo record will be written * **Checkpoint** — the oldest record not yet flushed to data files When `Write Pos` catches up to `Checkpoint`, MySQL stalls new writes, flushes dirty pages to disk, and advances `Checkpoint` to reclaim space. Redo log flushing is controlled by `innodb_flush_log_at_trx_commit`: | Value | Behavior | Safety | | ------------- | -------------------------------------------------- | ---------------------------------------- | | `0` | Buffer only; OS flushes when it wants | Low — up to 1 second of data loss | | `1` (default) | `fsync` to disk on every commit | High — no data loss | | `2` | Write to OS page cache on every commit; OS flushes | Medium — up to 1 second loss on OS crash | ### Binlog The binlog is a **Server-layer** log (not storage-engine-specific) that records every data-modifying statement and DDL executed against the database. Its two primary uses are **point-in-time recovery** and **replication**. Binlog flushing is controlled by `sync_binlog`: | Value | Behavior | | ------------- | ------------------------------------------- | | `0` (default) | Write to OS cache; OS decides when to flush | | `1` | `fsync` on every transaction commit | | `N` | `fsync` every N transactions | ### Two-phase commit Redo log and binlog are independent. If a crash occurs after one but before the other flushes, their contents diverge and replication breaks. InnoDB prevents this with **two-phase commit (2PC)**: 1. **Prepare phase**: InnoDB writes the redo log and marks it `PREPARE`. 2. **Commit phase**: Server writes binlog, then InnoDB marks the redo log `COMMIT`. On recovery, if the redo log is in `PREPARE` state and a matching binlog entry exists, MySQL completes the commit. If the binlog entry is missing, MySQL rolls back. ``` Transaction executes → undo log written (old values) → redo log written to buffer (new values) → [PREPARE] redo log flushed to disk → binlog written and flushed → [COMMIT] redo log marked committed ``` Setting both `innodb_flush_log_at_trx_commit=1` and `sync_binlog=1` gives you the strongest durability guarantee but incurs two `fsync` calls per commit. For write-heavy workloads, evaluate `sync_binlog=N` as a throughput trade-off. *** ## Replication MySQL replication uses the binlog to propagate changes from a **primary** (master) to one or more **replicas** (slaves). ### How replication works 1. The primary records each committed transaction to the binlog. 2. A replica's **I/O thread** connects to the primary, requests binlog events, and writes them to a local **relay log**. 3. A replica's **SQL thread** reads the relay log and replays each event against the replica's storage engine. The relay log decouples network I/O from SQL execution, lets replicas buffer events during slowdowns, and allows cascading replication (a replica acting as primary to further replicas). ### Replication models The primary commits and returns to the client immediately without waiting for any replica to acknowledge. Highest performance, but if the primary crashes before a replica receives an event, that data is lost. Introduced in MySQL 5.7. The primary waits for **at least one replica** to write the event to its relay log before returning to the client. At least one replica always has the latest committed data, eliminating the data-loss scenario of pure async replication with minimal latency impact. The primary waits for **all replicas** to acknowledge before returning. No data loss under any failure scenario, but latency equals the slowest replica. Rarely used in practice. ### Replication lag Replication lag is the time difference between when the primary commits a transaction and when the replica applies it. Common causes include: * **Single-threaded SQL thread**: older MySQL versions replay binlog events sequentially; a slow query on the replica stalls all subsequent events. * **Write spikes**: a bulk insert on the primary generates a large burst of binlog events that the replica's SQL thread must catch up on. * **Network bandwidth**: a slow link between primary and replica limits how quickly relay log entries arrive. Mitigations include multi-threaded replication (`slave_parallel_workers`), row-based binlog format for large batch operations, and dedicated network paths between primary and replica. ### Binlog formats | Format | How it records | Use cases | | ----------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | | `STATEMENT` | SQL statement text | Human-readable; small log size; non-deterministic functions can cause divergence | | `ROW` | Before and after images of each affected row | Deterministic; larger log size; required for some DDL and non-deterministic statements | | `MIXED` | `STATEMENT` by default; switches to `ROW` for non-deterministic statements | Balanced default for most workloads | Use `ROW` format if you subscribe to the binlog for change-data-capture (CDC) pipelines (e.g., Canal, Debezium). Statement-based events are harder to parse reliably and may not carry enough context for downstream consumers.