Skip to main content
Modern backend systems decompose monolithic applications into independently deployable services that communicate over the network. This decomposition buys you independent scaling, fault isolation, and the freedom to use different technology stacks per service—but it also introduces real complexity: How do services find each other? How do you coordinate a transaction that spans multiple databases? What happens when the network partitions? This page covers the architectural primitives and concrete tools that answer those questions: gRPC, etcd, the CAP theorem, distributed transactions, the SpringCloud ecosystem, and distributed locking.

gRPC vs REST

Both gRPC and REST are request/response protocols for inter-service communication. Choosing between them depends on your audience (internal vs. external) and performance requirements.

Comparison table

DimensiongRPCREST
Schema / contract.proto file — strict, version-controlledOpenAPI / ad hoc — flexible but inconsistent
EncodingProtocol Buffers (binary, compact)JSON (text, human-readable)
TransportHTTP/2 (multiplexed, streaming)HTTP/1.1 (one request per connection)
StreamingUnary, client stream, server stream, bidirectionalNot natively supported
Code generationStubs generated from .proto for any supported languageManual or tooling-dependent
Browser supportLimitedFirst-class
DebuggingHarder (binary payload)Easier (readable JSON)
Best fitInternal service-to-service callsPublic APIs, browser clients

When to use each

Use gRPC for internal service-to-service communication where you control both ends. The binary encoding is smaller on the wire, HTTP/2 multiplexing reduces connection overhead, and generated stubs eliminate entire classes of integration bugs. Use REST when exposing a public API, integrating with external partners, or serving browser clients. JSON is universally understood and easy to debug with standard tools.

Protobuf basics

Every gRPC service is defined in a .proto file. The compiler (protoc) generates client stubs and server interfaces for your target language:
syntax = "proto3";

package user;

service UserService {
  rpc GetUser (GetUserRequest) returns (UserResponse);
  rpc ListUsers (ListUsersRequest) returns (stream UserResponse);
}

message GetUserRequest {
  int64 user_id = 1;
}

message UserResponse {
  int64 id     = 1;
  string name  = 2;
  string email = 3;
}

message ListUsersRequest {
  int32 page      = 1;
  int32 page_size = 2;
}
The gRPC communication flow is:
  1. Define the API in a .proto file.
  2. Compile it to stub files for each language (Go, Java, Python, etc.).
  3. The server implements the generated interface and starts listening.
  4. The client calls stub methods as if they were local functions; gRPC handles serialization and transport.

Service Discovery with etcd

etcd is a distributed, strongly consistent key-value store written in Go. It uses the Raft consensus algorithm to replicate data across a cluster of nodes. Common use cases include service registration and discovery, distributed locks, configuration sharing, and leader election. etcd’s architecture is layered:
  • Client layer — SDK with built-in load balancing and automatic failover.
  • API network layer — v3 API uses gRPC; v3 also exposes an HTTP/1.x gateway for non-gRPC clients.
  • Raft layer — handles leader election, log replication, and consistency guarantees.
  • Logic layer — KV store, MVCC, leases, authentication.
  • Storage layer — write-ahead log (WAL) for crash safety, boltdb for persistent data.

Basic etcdctl operations

export ETCDCTL_API=3
ENDPOINTS=192.168.1.10:2379

# put and get
etcdctl --endpoints=$ENDPOINTS put /services/auth "http://10.0.0.1:8080"
etcdctl --endpoints=$ENDPOINTS get /services/auth

# prefix scan — retrieve all keys under a namespace
etcdctl --endpoints=$ENDPOINTS put /services/cart "http://10.0.0.2:8080"
etcdctl --endpoints=$ENDPOINTS put /services/order "http://10.0.0.3:8080"
etcdctl --endpoints=$ENDPOINTS get /services/ --prefix

# watch for changes
etcdctl --endpoints=$ENDPOINTS watch /services/ --prefix

# leases — auto-expire a key after N seconds (used for service heartbeats)
etcdctl --endpoints=$ENDPOINTS lease grant 30
# lease abc123 granted with TTL(30s)
etcdctl --endpoints=$ENDPOINTS put /services/auth "http://10.0.0.1:8080" --lease=abc123
etcdctl --endpoints=$ENDPOINTS lease keep-alive abc123

Service registration in Go

The following pattern registers a service instance under a lease. If the instance dies, the lease expires and the key is automatically removed:
func InitService(endpoints []string, ttl int64) (*ServiceRegister, error) {
    client, err := clientv3.New(clientv3.Config{
        Endpoints:   endpoints,
        DialTimeout: 5 * time.Second,
    })
    if err != nil {
        return nil, err
    }
    kv    := clientv3.NewKV(client)
    lease := clientv3.NewLease(client)
    return &ServiceRegister{client: client, kv: kv, lease: lease}, nil
}

func (s *ServiceRegister) Register(key, addr string, ttl int64) error {
    // create a lease
    resp, err := s.lease.Grant(context.TODO(), ttl)
    if err != nil {
        return err
    }
    // bind the key to the lease
    _, err = s.kv.Put(context.TODO(), key, addr,
        clientv3.WithLease(resp.ID))
    if err != nil {
        return err
    }
    // keep the lease alive automatically
    ctx, cancel := context.WithCancel(context.TODO())
    s.cancelFunc = cancel
    _, err = s.lease.KeepAlive(ctx, resp.ID)
    return err
}

Service discovery in Go

The discovery side watches a key prefix and maintains a local map of available instances:
func (c *Client) GetService(prefix string) ([]string, error) {
    resp, err := c.kv.Get(context.Background(), prefix,
        clientv3.WithPrefix())
    if err != nil {
        return nil, err
    }
    addrs := make([]string, 0, len(resp.Kvs))
    for _, kv := range resp.Kvs {
        addrs = append(addrs, string(kv.Value))
    }
    go c.watch(prefix) // stay up to date
    return addrs, nil
}

func (c *Client) watch(prefix string) {
    ch := c.watcher.Watch(context.Background(), prefix,
        clientv3.WithPrefix())
    for resp := range ch {
        for _, ev := range resp.Events {
            switch ev.Type {
            case mvccpb.PUT:
                c.serverList[string(ev.Kv.Key)] = string(ev.Kv.Value)
            case mvccpb.DELETE:
                delete(c.serverList, string(ev.Kv.Key))
            }
        }
    }
}

Raft consensus overview

etcd’s consistency guarantee comes from the Raft algorithm. Every write goes through a single leader who replicates the log entry to a majority of followers before acknowledging the client. Key properties:
  • Leader election — followers start an election if they don’t hear from the leader within a randomized timeout. A candidate wins by collecting votes from a majority of the cluster.
  • Log replication — the leader appends the client’s command to its log and sends it to all followers in parallel. Once a majority acknowledge, the entry is committed and applied.
  • Safety — a candidate can only win an election if its log is at least as up-to-date as the majority’s log, preventing stale data from becoming authoritative.
  • Random election timeouts — prevents split-vote deadlocks by ensuring followers start elections at different times.
A five-node cluster tolerates two failures. etcd’s Raft implementation makes it a CP system: it prioritizes consistency over availability when the cluster cannot reach a quorum.

CAP Theorem in Practice

The CAP theorem states that a distributed system can guarantee at most two of the following three properties simultaneously:
  • C — Consistency: every read returns the most recently written value or an error.
  • A — Availability: every request receives a (possibly stale) response; the system never refuses.
  • P — Partition tolerance: the system keeps operating even when network partitions split nodes into groups that cannot communicate.
Because network partitions are unavoidable in any real distributed system, partition tolerance is non-negotiable. Your actual choice is between CP and AP:
TypeBehaviorExamples
CPRejects or delays requests when it cannot guarantee consistencyetcd, ZooKeeper, distributed relational DBs
APReturns potentially stale data rather than refusingRedis, Cassandra, DynamoDB

Real-world tradeoffs

CP systems (like etcd) make writes wait for a majority of nodes to acknowledge. If the cluster loses quorum, writes fail. This is the right choice for service discovery, leader election, and configuration that must be correct. AP systems (like Redis) return the latest value a node has, even if replication hasn’t caught up. This is the right choice for caches, session stores, and real-time counters where a brief inconsistency is acceptable. Choosing C or A doesn’t mean completely abandoning the other property. A CP system still tries to serve reads from the leader as fast as possible; an AP system still replicates writes in the background. It’s a sliding scale, not a binary switch.

Distributed Transactions

When a business operation spans multiple services with independent databases, you cannot use a local ACID transaction. Two common approaches handle this:

Two-phase commit (2PC)

A coordinator asks all participants to prepare (lock resources and guarantee they can commit), then—once all confirm—sends a commit message to all. Strengths: strong consistency, atomicity across participants. Weaknesses: the coordinator is a single point of failure; if it crashes after prepare but before commit, participants are stuck holding locks indefinitely. High latency due to two round-trips. Not suitable for microservices at scale.

SAGA pattern

A SAGA is a sequence of local transactions, each with a compensating transaction that can undo its effect. If step N fails, the system executes the compensating transactions for steps N-1, N-2, … 1 in reverse order.
Order placed → Payment reserved → Inventory decremented → Shipping scheduled
     ↑               ↑                   ↑                        ↑
Cancel order   Refund payment    Restore inventory          Cancel shipment
(compensation) (compensation)     (compensation)            (compensation)
Strengths: each step is a local transaction—no distributed locking. Services remain loosely coupled. Works asynchronously with message queues. Weaknesses: intermediate states are visible (eventual consistency, not strong consistency). Compensating transactions can be complex to design correctly.

Eventual consistency

Most microservice systems accept eventual consistency for non-critical data. A write propagates to other services asynchronously via events or message queues. Services design their reads to tolerate brief staleness, and idempotent operations ensure that re-delivered events don’t cause double-writes.

SpringCloud Stack

SpringCloud provides production-ready implementations of the most common microservice patterns for Java services.

Service discovery: Nacos

Nacos is a popular service registry and configuration center widely used with SpringCloud. Each service registers itself on startup and deregisters on shutdown. Consumers query Nacos for the instance list and load-balance locally.
<dependency>
    <groupId>com.alibaba.cloud</groupId>
    <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
</dependency>
spring:
  cloud:
    nacos:
      server-addr: 192.168.1.100:8848
Nacos also supports config management: shared configuration is pushed to services at runtime without a restart—changes propagate automatically via a long-poll mechanism.

OpenFeign for HTTP service calls

OpenFeign generates HTTP client code from an annotated Java interface, eliminating manual RestTemplate wiring:
@FeignClient("item-service")
public interface ItemClient {
    @GetMapping("/items")
    List<ItemDTO> queryItemByIds(@RequestParam("ids") Collection<Long> ids);
}
Enable connection pooling with OKHttp to handle high concurrency:
<dependency>
    <groupId>io.github.openfeign</groupId>
    <artifactId>feign-okhttp</artifactId>
</dependency>
feign:
  okhttp:
    enabled: true

API Gateway: Spring Cloud Gateway

The gateway is a single entry point that routes requests to the appropriate microservice, enforces authentication, rate-limits traffic, and rewrites paths.
spring:
  cloud:
    gateway:
      routes:
        - id: item
          uri: lb://item-service       # lb = load-balanced via Nacos
          predicates:
            - Path=/items/**,/search/**
        - id: cart
          uri: lb://cart-service
          predicates:
            - Path=/carts/**

Circuit breakers

A circuit breaker monitors failures on an outbound call. If the failure rate exceeds a threshold, the circuit opens and subsequent calls fail fast (or return a fallback) without waiting for the downstream timeout. This prevents a slow service from cascading failures across the system. SpringCloud integrates with Resilience4j for circuit breaker, rate limiter, and retry policies.

Distributed Locks

In a single-process application you use synchronized or ReentrantLock. In a distributed system those primitives are local to one JVM instance. You need a lock whose state is visible to all instances.

Redis-based locks (SETNX)

Redis’s SET key value NX PX milliseconds command sets a key only if it does not exist, with an expiry time. This implements a basic distributed lock:
SET lock:resource uuid NX PX 10000
# NX = only set if not exists; PX = expire in ms
Problems with naive SETNX:
  1. Lock expiry under load — if the holder takes longer than the TTL, the lock expires, another instance acquires it, and the original holder still thinks it owns the lock.
  2. Wrong owner release — if instance A’s lock expires and instance B acquires it, then A finishes and calls DEL, it deletes B’s lock.

Redisson (production-ready Redis locking)

Redisson addresses both problems:
  • Watchdog / auto-renewal — a background thread extends the lease every 10 seconds (one-third of the default TTL) as long as the holder is still running.
  • Lock identity — the lock value is UUID + threadId. Only the exact owner can release it.
  • Reentrant support — the lock tracks a reentry count so the same thread can lock again without deadlocking.
RLock lock = redissonClient.getLock("order:lock:" + orderId);
lock.lock();
try {
    // critical section
} finally {
    lock.unlock();
}

etcd-based locks

etcd’s lease mechanism provides a naturally expiring distributed lock:
etcdctl --endpoints=$ENDPOINTS lock mutex1
# blocks until the lock is acquired; Ctrl+C releases it
In code, the etcd client creates a sequential ephemeral key under a prefix. The holder with the smallest sequence number owns the lock; others watch the predecessor key for deletion. When to use each:
Redis (Redisson)etcd
PerformanceVery high (sub-millisecond)Moderate
ConsistencyAP (strong with Redlock across replicas)CP (Raft-backed)
SetupSimpleRequires etcd cluster
Best forHigh-throughput locks (flash sales, rate limiting)Infrastructure-level locks (leader election, config writes)