Distributed systems split computation and data across multiple machines to achieve scale, fault tolerance, or geographic distribution. That distribution introduces a fundamental set of problems that do not exist on a single node: nodes can crash independently, network links can drop or partition, clocks drift, and messages can arrive out of order or not at all. This page covers the core theoretical constraints—CAP theorem and Raft consensus—and the practical engineering patterns—distributed locks, idempotency, data models, and reliability primitives—that you need to build correct, resilient distributed services.
CAP Theorem
CAP theorem states that a distributed data system can provide at most two of the following three guarantees simultaneously:
| Property | Meaning |
|---|
| Consistency (C) | Every read receives the most recent write, or an error. All nodes see the same data at the same time. |
| Availability (A) | Every request receives a non-error response, though it may not contain the most recent data. |
| Partition Tolerance (P) | The system continues operating even when network partitions drop or delay messages between nodes. |
Why you can only have two: Consider a primary node that must sync writes to replicas before responding. To guarantee consistency, it must wait for all replicas to acknowledge the write—which can be very slow or impossible during a partition. To guarantee availability, it can respond as soon as one replica confirms—but then a reader hitting an unsynchronized replica sees stale data. You cannot have both strong guarantees at once across an unreliable network.
For distributed systems, P is non-negotiable. Network partitions happen in any real multi-node deployment. This means every distributed system is really choosing between CP and AP:
-
CP systems prioritize data correctness and will refuse or delay responses during a partition to avoid returning inconsistent data. Examples: etcd, ZooKeeper, distributed SQL databases (CockroachDB, TiDB). These are used for coordination, configuration, and leader election where stale data is dangerous.
-
AP systems prioritize uptime and respond quickly even if the data might be slightly stale. Examples: Redis (with replication), Cassandra, DynamoDB. These are used for caches, shopping carts, and feeds where a brief inconsistency is acceptable.
Choosing CP or AP does not mean completely abandoning the other property. In practice, you downgrade the non-chosen property rather than eliminate it. A CP system still aims for five-nines availability; an AP system uses techniques like read-repair and anti-entropy to minimize inconsistency windows.
Raft Consensus Algorithm
Raft is a consensus algorithm designed to be more understandable than Paxos while providing equivalent guarantees. It is used in production systems including etcd (the backing store for Kubernetes), CockroachDB, and TiKV.
What Raft Solves
Raft solves the replicated log problem: how do you keep a log of commands identical across N servers even when some servers crash or messages are delayed? A consistent replicated log lets you build any fault-tolerant state machine—configuration stores, coordination services, or distributed databases.
Node Roles
At any moment, every Raft node is in exactly one of three states:
| Role | Responsibility |
|---|
| Leader | There is exactly one leader per term. The leader accepts client writes, appends them to its log, and replicates them to followers. |
| Follower | Passively receives log entries from the leader and votes in elections. |
| Candidate | A follower that has timed out waiting for the leader and is campaigning for election. |
Leader Election
Raft uses a heartbeat mechanism to trigger elections. When a follower receives no heartbeat within a randomized election timeout (typically 150–300 ms), it increments its term number, converts to candidate, and requests votes from all other nodes.
A candidate wins if it receives votes from a majority of nodes. Each node votes at most once per term, using first-come-first-served. The winner immediately sends heartbeats to all followers, establishing its authority for the new term.
Handling split votes: If no candidate gets a majority (a “split vote” or “split brain”), every candidate times out and starts a new election with an incremented term. Raft uses randomized timeouts so that one candidate usually starts the next election before others do, breaking the tie quickly. This approach proved more reliable in practice than priority-based schemes.
Election safety constraint: A candidate only receives a vote if its log is at least as up-to-date as the voter’s log (compared by term number and log length). This prevents a node with stale data from becoming leader and overwriting committed entries.
Log Replication
Once elected, the leader:
- Accepts a client request and appends it as a new entry to its own log.
- Sends
AppendEntries RPCs in parallel to all followers.
- Waits for a majority of followers to acknowledge the entry.
- Marks the entry as committed and applies it to the state machine.
- Notifies followers of the commit in the next heartbeat so they apply the entry too.
If a follower’s log diverges from the leader’s (due to a previous leader crashing mid-replication), the leader forces the follower to overwrite conflicting entries by finding the last point of agreement and resending from there. The leader’s log always wins; leaders never rewrite their own entries.
Why Raft over Paxos?
The authors of Raft’s original paper (In Search of an Understandable Consensus Algorithm) designed it explicitly for understandability. Paxos is notoriously difficult to implement correctly because the paper describes only single-decree consensus (agreement on one value), leaving the multi-decree extension (a replicated log) largely as an exercise. Raft decomposes the problem into leader election, log replication, and safety, each with clear invariants, making it far easier to implement correctly and reason about.
Distributed Locks
In a multi-process or multi-machine environment, local OS mutexes protect shared state only within a single process. When multiple service instances share a resource—a database row, a file, a rate-limit counter—you need a lock backed by an external system all instances can reach.
Implementation Options
| Approach | Mechanism | Pros | Cons |
|---|
| Database row lock | Insert/update a “lock” row; use DB transactions for atomicity | Simple, uses existing infrastructure | Poor performance under high contention; no auto-expiry without extra columns |
| Redis SETNX | SET lock_key uuid NX PX ttl | Very fast; TTL prevents deadlocks | Risk of lock loss on Redis primary failover; requires Redlock for multi-node safety |
| ZooKeeper ephemeral nodes | Create a sequential ephemeral node; become leader if smallest | Strong consistency; auto-releases when client disconnects | Complex to operate; lower throughput than Redis |
| etcd lease | Acquire a key with a lease; renew via keepalive | Strong consistency (CP); lease auto-expires on crash | Requires etcd cluster; sensitive to network latency |
Redis SETNX: Known Problems and Solutions
A naive Redis lock using SETNX has two failure modes:
- Lock expiry before the holder finishes: A slow thread holds the lock past its TTL; a second thread acquires it; both are now in the critical section simultaneously.
- Wrong owner releases the lock: Thread A’s lock expires, Thread B acquires it, then Thread A completes and deletes Thread B’s lock key.
Watchdog (heartbeat renewal): A background thread periodically checks whether the current process still holds the lock and extends the TTL if so. This prevents expiry on slow-but-alive holders.
Owner identity check: Store the lock value as a UUID unique to the lock acquisition. Before deleting, verify the stored value matches your UUID. Use a Lua script to make the check-and-delete atomic:
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
Redisson (Java): The Redisson library implements all of the above—watchdog renewal, owner-based release, and re-entrancy (tracking a lock holder and reentry count in a Redis hash)—so you don’t have to build it yourself.
etcd Lease-Based Lock
etcd’s lease mechanism makes distributed locking straightforward:
# Acquire a lock named "mutex1"; blocks if already held
etcdctl --endpoints=$ENDPOINTS lock mutex1
Programmatically, you create a lease with a TTL, associate the lock key with that lease, and call KeepAlive to renew it. When your process crashes, the lease expires and the lock is automatically released. etcd’s Raft-based CP consistency means there is never a split-brain scenario where two processes both believe they hold the lock.
Idempotency
In distributed systems, network failures mean that a caller often cannot tell whether a request was received and processed. The safe response is to retry—but retrying a non-idempotent operation (like “charge the customer $50”) can cause duplicate side effects.
Idempotency means that performing an operation multiple times has the same effect as performing it once.
Why It Matters: Delivery Semantics
| Guarantee | Meaning | Risk |
|---|
| At-most-once | Send once; never retry | Messages may be lost |
| At-least-once | Retry until acknowledged | Duplicates possible |
| Exactly-once | Deliver precisely once | Hardest to achieve; requires idempotency on the receiver |
Most messaging systems (Kafka, RabbitMQ) provide at-least-once delivery by default. Exactly-once is either unavailable or expensive. The practical approach is at-least-once delivery + idempotent consumers.
Idempotency Keys
You make an endpoint idempotent by having the caller include a unique idempotency key with each request. The server stores the result of the first request keyed by that value; on retries with the same key, it returns the stored result without re-executing the operation.
POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{ "amount": 50, "currency": "USD" }
Stripe’s payments API and most financial APIs require an idempotency key on all mutating requests. The key is typically a UUID generated by the client before the first attempt and reused on all retries for the same logical operation.
Implementation Checklist
- Generate the idempotency key client-side (UUID v4) before the first attempt.
- Store
(key → response) in a durable store (database, Redis with persistence) with a TTL.
- Return the cached response on duplicate keys without re-executing business logic.
- Scope idempotency keys per user or per API endpoint to avoid cross-request collisions.
Data Models and Query Languages
(Based on DDIA Chapter 2)
How you model data shapes both the code you write and what questions you can efficiently answer. Three dominant models cover most use cases:
Relational Model
The relational model (SQL) organizes data into tables of rows and columns, with foreign keys representing relationships. It excels at many-to-many relationships, joins, and ad-hoc queries across normalized data. Schema changes can be slow and require careful migration planning.
Use it when: your data is highly interconnected, you need strong transactional guarantees (ACID), or you require complex reporting queries.
Document Model
Document databases (MongoDB, CouchDB) store self-contained, nested documents (JSON or BSON). The document model maps naturally to the objects in application code, avoiding the “impedance mismatch” between OO models and relational tables. It handles one-to-many relationships (e.g., a user’s list of addresses) efficiently because the related data is co-located in the document.
Limitations: documents should not be too large (entire document is read/written on access), and multi-document joins are expensive or unsupported.
Use it when: your data has a natural document structure, you need schema flexibility, or you have high write throughput requirements.
Graph Model
Graph databases (Neo4j, Amazon Neptune) store data as nodes and edges, making it efficient to traverse complex many-to-many relationships. Social networks (who follows whom), recommendation engines (users who bought X also bought Y), and fraud detection (chains of suspicious transactions) all fit the graph model.
Use it when: the relationships between entities are as important as the entities themselves, and you need multi-hop traversal queries.
Choosing a Model
| Criterion | Relational | Document | Graph |
|---|
| Relationship complexity | Many-to-many | One-to-many | Any-to-any (multi-hop) |
| Schema flexibility | Low (strict) | High (schema-on-read) | Medium |
| Join performance | Good (indexed FKs) | Poor | Native |
| Query language | SQL (declarative) | Query API / aggregation | Cypher / Gremlin |
Reliability Patterns
Reliability patterns prevent a failure in one component from cascading into a full system outage.
Circuit Breaker
A circuit breaker wraps calls to a dependency (database, external API). It tracks recent failures and, when failures exceed a threshold, opens the circuit—immediately returning an error instead of attempting the call. After a cooldown period, it enters half-open state and allows a probe request. If the probe succeeds, the circuit closes; if not, it opens again.
Three states: Closed (normal), Open (failing fast), Half-Open (testing recovery).
Closed ──(failure threshold exceeded)──▶ Open
Open ──(cooldown expires)──▶ Half-Open
Half-Open ──(probe succeeds)──▶ Closed
Half-Open ──(probe fails)──▶ Open
Netflix’s Hystrix popularized this pattern; Resilience4j is the modern Java successor.
Bulkhead
The bulkhead pattern isolates components into separate thread pools (or resource pools) so that a failing component cannot exhaust resources shared by healthy ones. Named after the watertight compartments in a ship’s hull: if one compartment floods, the others stay dry.
Example: Give your payment service calls a dedicated thread pool of 10 threads. Even if payments are completely hung, your user-profile calls still have their own pool and remain responsive.
Retry with Exponential Backoff
Transient failures (brief network blips, momentary rate limits) can be resolved by retrying. However, retrying immediately at full speed turns a small blip into a thundering herd that can overwhelm a recovering service. Exponential backoff spaces out retries with increasing delays:
retry 1: wait 1s
retry 2: wait 2s
retry 3: wait 4s
retry 4: wait 8s
...up to a max cap (e.g., 60s)
Add jitter (a random fraction of the delay) to spread retries from multiple clients across time rather than synchronizing them into a spike.
Timeout
Always set a timeout on every network call. Without timeouts, a slow dependency can hold your threads indefinitely, eventually starving your thread pool and causing a cascading failure. Set timeouts at two levels:
- Connection timeout: how long to wait to establish the TCP connection.
- Read timeout: how long to wait for a response after the connection is established.
The right timeout value depends on your SLA. A common rule of thumb is to set the read timeout at roughly the p99 latency of the dependency plus a safety margin—not so short that normal slow requests fail, but short enough to fail fast when the dependency is genuinely down.
Combine these patterns: use a circuit breaker that wraps retries with exponential backoff, apply bulkheads to isolate critical dependencies, and always set timeouts on every outbound call. This defense-in-depth approach is how production systems at scale survive partial outages without full service degradation.