Skip to main content
Distributed systems split computation and data across multiple machines to achieve scale, fault tolerance, or geographic distribution. That distribution introduces a fundamental set of problems that do not exist on a single node: nodes can crash independently, network links can drop or partition, clocks drift, and messages can arrive out of order or not at all. This page covers the core theoretical constraints—CAP theorem and Raft consensus—and the practical engineering patterns—distributed locks, idempotency, data models, and reliability primitives—that you need to build correct, resilient distributed services.

CAP Theorem

CAP theorem states that a distributed data system can provide at most two of the following three guarantees simultaneously:
PropertyMeaning
Consistency (C)Every read receives the most recent write, or an error. All nodes see the same data at the same time.
Availability (A)Every request receives a non-error response, though it may not contain the most recent data.
Partition Tolerance (P)The system continues operating even when network partitions drop or delay messages between nodes.
Why you can only have two: Consider a primary node that must sync writes to replicas before responding. To guarantee consistency, it must wait for all replicas to acknowledge the write—which can be very slow or impossible during a partition. To guarantee availability, it can respond as soon as one replica confirms—but then a reader hitting an unsynchronized replica sees stale data. You cannot have both strong guarantees at once across an unreliable network. For distributed systems, P is non-negotiable. Network partitions happen in any real multi-node deployment. This means every distributed system is really choosing between CP and AP:
  • CP systems prioritize data correctness and will refuse or delay responses during a partition to avoid returning inconsistent data. Examples: etcd, ZooKeeper, distributed SQL databases (CockroachDB, TiDB). These are used for coordination, configuration, and leader election where stale data is dangerous.
  • AP systems prioritize uptime and respond quickly even if the data might be slightly stale. Examples: Redis (with replication), Cassandra, DynamoDB. These are used for caches, shopping carts, and feeds where a brief inconsistency is acceptable.
Choosing CP or AP does not mean completely abandoning the other property. In practice, you downgrade the non-chosen property rather than eliminate it. A CP system still aims for five-nines availability; an AP system uses techniques like read-repair and anti-entropy to minimize inconsistency windows.

Raft Consensus Algorithm

Raft is a consensus algorithm designed to be more understandable than Paxos while providing equivalent guarantees. It is used in production systems including etcd (the backing store for Kubernetes), CockroachDB, and TiKV.

What Raft Solves

Raft solves the replicated log problem: how do you keep a log of commands identical across N servers even when some servers crash or messages are delayed? A consistent replicated log lets you build any fault-tolerant state machine—configuration stores, coordination services, or distributed databases.

Node Roles

At any moment, every Raft node is in exactly one of three states:
RoleResponsibility
LeaderThere is exactly one leader per term. The leader accepts client writes, appends them to its log, and replicates them to followers.
FollowerPassively receives log entries from the leader and votes in elections.
CandidateA follower that has timed out waiting for the leader and is campaigning for election.

Leader Election

Raft uses a heartbeat mechanism to trigger elections. When a follower receives no heartbeat within a randomized election timeout (typically 150–300 ms), it increments its term number, converts to candidate, and requests votes from all other nodes. A candidate wins if it receives votes from a majority of nodes. Each node votes at most once per term, using first-come-first-served. The winner immediately sends heartbeats to all followers, establishing its authority for the new term. Handling split votes: If no candidate gets a majority (a “split vote” or “split brain”), every candidate times out and starts a new election with an incremented term. Raft uses randomized timeouts so that one candidate usually starts the next election before others do, breaking the tie quickly. This approach proved more reliable in practice than priority-based schemes. Election safety constraint: A candidate only receives a vote if its log is at least as up-to-date as the voter’s log (compared by term number and log length). This prevents a node with stale data from becoming leader and overwriting committed entries.

Log Replication

Once elected, the leader:
  1. Accepts a client request and appends it as a new entry to its own log.
  2. Sends AppendEntries RPCs in parallel to all followers.
  3. Waits for a majority of followers to acknowledge the entry.
  4. Marks the entry as committed and applies it to the state machine.
  5. Notifies followers of the commit in the next heartbeat so they apply the entry too.
If a follower’s log diverges from the leader’s (due to a previous leader crashing mid-replication), the leader forces the follower to overwrite conflicting entries by finding the last point of agreement and resending from there. The leader’s log always wins; leaders never rewrite their own entries.

Why Raft over Paxos?

The authors of Raft’s original paper (In Search of an Understandable Consensus Algorithm) designed it explicitly for understandability. Paxos is notoriously difficult to implement correctly because the paper describes only single-decree consensus (agreement on one value), leaving the multi-decree extension (a replicated log) largely as an exercise. Raft decomposes the problem into leader election, log replication, and safety, each with clear invariants, making it far easier to implement correctly and reason about.

Distributed Locks

In a multi-process or multi-machine environment, local OS mutexes protect shared state only within a single process. When multiple service instances share a resource—a database row, a file, a rate-limit counter—you need a lock backed by an external system all instances can reach.

Implementation Options

ApproachMechanismProsCons
Database row lockInsert/update a “lock” row; use DB transactions for atomicitySimple, uses existing infrastructurePoor performance under high contention; no auto-expiry without extra columns
Redis SETNXSET lock_key uuid NX PX ttlVery fast; TTL prevents deadlocksRisk of lock loss on Redis primary failover; requires Redlock for multi-node safety
ZooKeeper ephemeral nodesCreate a sequential ephemeral node; become leader if smallestStrong consistency; auto-releases when client disconnectsComplex to operate; lower throughput than Redis
etcd leaseAcquire a key with a lease; renew via keepaliveStrong consistency (CP); lease auto-expires on crashRequires etcd cluster; sensitive to network latency

Redis SETNX: Known Problems and Solutions

A naive Redis lock using SETNX has two failure modes:
  1. Lock expiry before the holder finishes: A slow thread holds the lock past its TTL; a second thread acquires it; both are now in the critical section simultaneously.
  2. Wrong owner releases the lock: Thread A’s lock expires, Thread B acquires it, then Thread A completes and deletes Thread B’s lock key.
Watchdog (heartbeat renewal): A background thread periodically checks whether the current process still holds the lock and extends the TTL if so. This prevents expiry on slow-but-alive holders. Owner identity check: Store the lock value as a UUID unique to the lock acquisition. Before deleting, verify the stored value matches your UUID. Use a Lua script to make the check-and-delete atomic:
if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end
Redisson (Java): The Redisson library implements all of the above—watchdog renewal, owner-based release, and re-entrancy (tracking a lock holder and reentry count in a Redis hash)—so you don’t have to build it yourself.

etcd Lease-Based Lock

etcd’s lease mechanism makes distributed locking straightforward:
# Acquire a lock named "mutex1"; blocks if already held
etcdctl --endpoints=$ENDPOINTS lock mutex1
Programmatically, you create a lease with a TTL, associate the lock key with that lease, and call KeepAlive to renew it. When your process crashes, the lease expires and the lock is automatically released. etcd’s Raft-based CP consistency means there is never a split-brain scenario where two processes both believe they hold the lock.

Idempotency

In distributed systems, network failures mean that a caller often cannot tell whether a request was received and processed. The safe response is to retry—but retrying a non-idempotent operation (like “charge the customer $50”) can cause duplicate side effects. Idempotency means that performing an operation multiple times has the same effect as performing it once.

Why It Matters: Delivery Semantics

GuaranteeMeaningRisk
At-most-onceSend once; never retryMessages may be lost
At-least-onceRetry until acknowledgedDuplicates possible
Exactly-onceDeliver precisely onceHardest to achieve; requires idempotency on the receiver
Most messaging systems (Kafka, RabbitMQ) provide at-least-once delivery by default. Exactly-once is either unavailable or expensive. The practical approach is at-least-once delivery + idempotent consumers.

Idempotency Keys

You make an endpoint idempotent by having the caller include a unique idempotency key with each request. The server stores the result of the first request keyed by that value; on retries with the same key, it returns the stored result without re-executing the operation.
POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{ "amount": 50, "currency": "USD" }
Stripe’s payments API and most financial APIs require an idempotency key on all mutating requests. The key is typically a UUID generated by the client before the first attempt and reused on all retries for the same logical operation.

Implementation Checklist

  • Generate the idempotency key client-side (UUID v4) before the first attempt.
  • Store (key → response) in a durable store (database, Redis with persistence) with a TTL.
  • Return the cached response on duplicate keys without re-executing business logic.
  • Scope idempotency keys per user or per API endpoint to avoid cross-request collisions.

Data Models and Query Languages

(Based on DDIA Chapter 2) How you model data shapes both the code you write and what questions you can efficiently answer. Three dominant models cover most use cases:

Relational Model

The relational model (SQL) organizes data into tables of rows and columns, with foreign keys representing relationships. It excels at many-to-many relationships, joins, and ad-hoc queries across normalized data. Schema changes can be slow and require careful migration planning. Use it when: your data is highly interconnected, you need strong transactional guarantees (ACID), or you require complex reporting queries.

Document Model

Document databases (MongoDB, CouchDB) store self-contained, nested documents (JSON or BSON). The document model maps naturally to the objects in application code, avoiding the “impedance mismatch” between OO models and relational tables. It handles one-to-many relationships (e.g., a user’s list of addresses) efficiently because the related data is co-located in the document. Limitations: documents should not be too large (entire document is read/written on access), and multi-document joins are expensive or unsupported. Use it when: your data has a natural document structure, you need schema flexibility, or you have high write throughput requirements.

Graph Model

Graph databases (Neo4j, Amazon Neptune) store data as nodes and edges, making it efficient to traverse complex many-to-many relationships. Social networks (who follows whom), recommendation engines (users who bought X also bought Y), and fraud detection (chains of suspicious transactions) all fit the graph model. Use it when: the relationships between entities are as important as the entities themselves, and you need multi-hop traversal queries.

Choosing a Model

CriterionRelationalDocumentGraph
Relationship complexityMany-to-manyOne-to-manyAny-to-any (multi-hop)
Schema flexibilityLow (strict)High (schema-on-read)Medium
Join performanceGood (indexed FKs)PoorNative
Query languageSQL (declarative)Query API / aggregationCypher / Gremlin

Reliability Patterns

Reliability patterns prevent a failure in one component from cascading into a full system outage.

Circuit Breaker

A circuit breaker wraps calls to a dependency (database, external API). It tracks recent failures and, when failures exceed a threshold, opens the circuit—immediately returning an error instead of attempting the call. After a cooldown period, it enters half-open state and allows a probe request. If the probe succeeds, the circuit closes; if not, it opens again. Three states: Closed (normal), Open (failing fast), Half-Open (testing recovery).
Closed ──(failure threshold exceeded)──▶ Open
Open ──(cooldown expires)──▶ Half-Open
Half-Open ──(probe succeeds)──▶ Closed
Half-Open ──(probe fails)──▶ Open
Netflix’s Hystrix popularized this pattern; Resilience4j is the modern Java successor.

Bulkhead

The bulkhead pattern isolates components into separate thread pools (or resource pools) so that a failing component cannot exhaust resources shared by healthy ones. Named after the watertight compartments in a ship’s hull: if one compartment floods, the others stay dry. Example: Give your payment service calls a dedicated thread pool of 10 threads. Even if payments are completely hung, your user-profile calls still have their own pool and remain responsive.

Retry with Exponential Backoff

Transient failures (brief network blips, momentary rate limits) can be resolved by retrying. However, retrying immediately at full speed turns a small blip into a thundering herd that can overwhelm a recovering service. Exponential backoff spaces out retries with increasing delays:
retry 1: wait  1s
retry 2: wait  2s
retry 3: wait  4s
retry 4: wait  8s
...up to a max cap (e.g., 60s)
Add jitter (a random fraction of the delay) to spread retries from multiple clients across time rather than synchronizing them into a spike.

Timeout

Always set a timeout on every network call. Without timeouts, a slow dependency can hold your threads indefinitely, eventually starving your thread pool and causing a cascading failure. Set timeouts at two levels:
  • Connection timeout: how long to wait to establish the TCP connection.
  • Read timeout: how long to wait for a response after the connection is established.
The right timeout value depends on your SLA. A common rule of thumb is to set the read timeout at roughly the p99 latency of the dependency plus a safety margin—not so short that normal slow requests fail, but short enough to fail fast when the dependency is genuinely down.
Combine these patterns: use a circuit breaker that wraps retries with exponential backoff, apply bulkheads to isolate critical dependencies, and always set timeouts on every outbound call. This defense-in-depth approach is how production systems at scale survive partial outages without full service degradation.