Skip to main content
System design interviews test your ability to think at scale: to start from an ambiguous requirement, ask the right clarifying questions, make explicit tradeoffs, and produce an architecture you can defend. Unlike algorithm interviews, there is no single correct answer — the interviewer is evaluating your design process as much as the specific choices you make. This page gives you a repeatable framework, a toolkit of scalability patterns, and worked examples for the most commonly asked system design problems.

The Design Framework

Follow these four steps in every system design interview. Spend your time roughly as shown — interviewers expect you to reach a working design, not spend fifteen minutes on requirements.
1

Clarify requirements (5 minutes)

Never assume you understand the requirements. Ask about:
  • Functional requirements: What does the system actually do? What features are in scope for this session?
  • Non-functional requirements: What are the latency targets (p99 < 100 ms?), availability expectations (99.9%? 99.99%?), and consistency requirements (eventual vs. strong)?
  • Scale: How many users? Daily active users vs. total users? Read-heavy or write-heavy? What are the peak QPS expectations?
  • Constraints: Any existing infrastructure to integrate with? Budget or technology constraints?
Document your assumptions explicitly. Saying “I’ll assume 100 million registered users with 10 million daily active users” shows structured thinking and gives the interviewer a chance to correct you early.
2

Estimate scale (5 minutes)

Back-of-the-envelope numbers guide every subsequent decision. Use round numbers — precision is not the point.QPS example: 100M daily active users × 10 requests/day = 1B requests/day ÷ 86,400 seconds ≈ 12,000 QPS average. Assume 3–5× peak: ~50,000 QPS peak.Storage example: If each user object is 1 KB and you have 100M users = 100 GB. If you store 1 year of events at 1 KB each and users generate 10 events/day = 100M × 10 × 365 × 1 KB ≈ 365 TB/year.Bandwidth: 50,000 QPS × 1 KB average response = 50 MB/s outbound. If responses are 100 KB (e.g., image thumbnails): 5 GB/s.These numbers tell you whether a single database can handle the read load (typically yes up to ~10K QPS with caching), whether you need sharding, and whether a CDN is non-negotiable.
3

Design the high-level architecture (15 minutes)

Sketch the major components and data flows. A typical backend system includes:
  • Clients (web, mobile, third-party)
  • DNS + CDN for static assets and geographic distribution
  • Load balancer distributing traffic across API servers
  • API servers (stateless; horizontally scalable)
  • Primary database (source of truth)
  • Cache layer (Redis, Memcached)
  • Message queue (Kafka, RabbitMQ) for async workloads
  • Background workers consuming from the queue
  • Object storage (S3) for blobs
Identify which components are stateless (easy to scale by adding instances) and which are stateful (require careful sharding, replication, or coordination). Explain your data model at a high level — the primary entities and their relationships.
4

Deep dive into critical components (10 minutes)

The interviewer will usually steer you toward one or two areas to explore further. Common deep dives:
  • Database schema and indexing: Which columns need indexes? How would you handle N+1 queries?
  • Caching strategy: What do you cache? How long? What is your cache invalidation strategy?
  • Sharding: How do you partition the data? By user ID hash? By geography? What are the hotspot risks?
  • Failure modes: What happens when a service goes down? When the database is unavailable? When the cache is cold?
  • Consistency model: Do you need strong consistency or is eventual consistency acceptable? Where are the tradeoffs?
Bring up tradeoffs proactively rather than waiting to be asked. Saying “I chose eventual consistency here to reduce write latency — the risk is users might briefly see stale data, which I judge acceptable for this use case” demonstrates senior-level thinking.

Scalability Patterns

Horizontal vs. Vertical Scaling

Vertical scaling means adding more CPU, RAM, or faster storage to a single machine. It is simple operationally — no code changes — but has a hard ceiling, costs superlinearly, and creates a single point of failure. Horizontal scaling means adding more machines. It requires stateless services (or explicit state management), a load balancer, and distributed data, but it scales to arbitrary load and improves fault tolerance. In practice: scale vertically until the cost or ceiling becomes a problem, then move to horizontal scaling for stateless tiers (API servers, workers) and vertical + read replicas for databases before introducing full horizontal sharding.

Load Balancing Strategies

AlgorithmHow It WorksBest For
Round RobinRequests distributed in sequence across serversHomogeneous servers with similar request costs
Weighted Round RobinServers get traffic proportional to their weightMixed hardware capacity
Least ConnectionsNew request goes to the server with fewest active connectionsVariable request duration (long-lived connections)
IP HashClient IP determines which server handles requestsSession affinity (sticky sessions) without a session store
RandomRandomly select a serverSimilar to round robin, statistically equivalent at scale
Layer-7 load balancers (Nginx, HAProxy, AWS ALB) can route based on URL path, headers, or content type — allowing you to send /api/v1/upload to a high-memory server pool and /api/v1/query to a compute-optimized pool.

Caching Layers

Cache at multiple levels for maximum throughput:
  1. Client-side / browser cache: HTTP caching headers (Cache-Control, ETag) prevent unnecessary requests entirely.
  2. CDN: Geographic caching for static assets and cacheable API responses. Dramatically reduces origin load and latency for global users.
  3. Application cache (Redis/Memcached): Cache expensive database query results, aggregated data, or computed values. Typical TTL-based invalidation.
  4. Database query cache: MySQL’s built-in query cache (deprecated in MySQL 8.0) or result caching in the ORM layer.
Cache-aside (lazy loading): Application checks cache first; on miss, fetches from DB and populates cache. Simple but suffers cold-start lag after a cache flush. Write-through: Write to cache and DB simultaneously. Cache is always warm but writes are slightly slower. Write-behind (write-back): Write to cache only; flush to DB asynchronously. Fastest writes, highest risk of data loss.

Storage Selection Guide

Choosing the right storage system for each use case is a core system design skill.
SystemBest ForTradeoffs
MySQL / PostgreSQLStructured relational data requiring ACID transactions, complex JOINs, and strong consistency. User accounts, financial records, order management.Not designed for horizontal write sharding; vertical scaling has limits.
RedisSub-millisecond read/write access, caching, rate limiting, session storage, leaderboards, pub/sub. Data fits in memory.Memory is expensive; persistence is secondary to performance; limited query capability.
ElasticsearchFull-text search, log aggregation, fuzzy matching, faceted filtering.Eventually consistent, operationally complex, not a source of truth. Write throughput is lower than a relational DB.
KafkaHigh-throughput event streaming, durable message queuing, event sourcing, real-time analytics pipelines. Consumers can replay from any offset.Not a database; designed for append-only streams; operational overhead.
S3 / Object StorageBlobs: images, videos, backups, large files. Unlimited scale, cheap storage per GB.No transactions, no query capability, higher latency than in-memory systems.
CassandraHigh write throughput at massive scale, multi-region active-active, time-series data.Eventual consistency, limited query patterns (no arbitrary JOINs), schema design is access-pattern-driven.
A typical backend system uses MySQL as the source of truth, Redis for caching hot data, Kafka for async event processing, and S3 for blob storage — each doing what it is best at.

Common System Design Problems

URL Shortener

Core requirements: Given a long URL, generate a short code (e.g., flightaware.com/s/abc123). Redirect short codes to the original URL. Approach:
  • Generate a 6–8 character base62 code (a–z, A–Z, 0–9) for the short path. Use a counter + base62 encoding or a hash of the long URL truncated to 6 characters.
  • Store the mapping in MySQL: (short_code, long_url, created_at, user_id). Index on short_code for O(log n) lookup.
  • Cache hot short codes in Redis with a TTL. At 100K QPS, most codes are cold — cache the top 20% that account for 80% of traffic.
  • Redirect with HTTP 301 (permanent, client caches) or 302 (temporary, each click hits your server). 302 gives accurate analytics; 301 reduces server load.

Rate Limiter

Core requirements: Limit each user to N requests per time window. Approach:
  • Token bucket algorithm: Each user has a bucket with capacity N. Tokens refill at a fixed rate. A request consumes one token; if the bucket is empty, reject the request. Allows short bursts up to capacity.
  • Fixed window counter: Count requests per user per window (e.g., per minute). Simple but allows 2× the rate at window boundaries (burst at end of one window + start of next).
  • Sliding window log: Store timestamps of recent requests in a sorted set. On each request, remove expired timestamps and check the count. Accurate but memory-intensive.
Implementation with Redis:
# Token bucket with Redis + Lua script (atomic)
EVAL "
  local key = KEYS[1]
  local capacity = tonumber(ARGV[1])
  local refill_rate = tonumber(ARGV[2])
  local now = tonumber(ARGV[3])
  ...
" 1 user:{user_id}:rate_limit {capacity} {refill_rate} {timestamp}
Distributed rate limiting requires all API servers to share state — Redis is the standard choice because INCR is atomic.

Notification System

Core requirements: Send notifications (push, email, SMS) to millions of users. Some are time-sensitive (payment alerts), some are marketing. Approach:
  • Decouple with a queue: API server publishes notification events to Kafka. Worker pools consume from Kafka and call third-party providers (APNs, FCM, SendGrid, Twilio).
  • Fanout service: For broadcast notifications (all users, or segments), a fanout service reads user segments and publishes one event per user. Use separate Kafka topics for push, email, and SMS to allow independent scaling.
  • Priority queues: Put time-sensitive notifications on a high-priority topic with a dedicated consumer pool. Marketing messages go to a lower-priority topic processed during off-peak hours.
  • Delivery tracking: Store notification status in MySQL (pending → sent → delivered → failed). Retry failed deliveries with exponential backoff.

News Feed / Timeline

Core requirements: Each user sees a reverse-chronological feed of posts from the people they follow. Pull model (fan-out on read): When a user loads their feed, query all accounts they follow, fetch recent posts from each, merge and sort. Simple writes but expensive reads — N database queries per feed load. Push model (fan-out on write): When a user creates a post, write it to every follower’s feed cache immediately. Feed reads are O(1) Redis reads. Expensive writes — a celebrity with 10M followers generates 10M write operations per post. Hybrid model (used by most large systems): Push for average users; pull for celebrities. Cache the feed in Redis as a sorted set keyed by timestamp. On read, merge the precomputed feed with the latest posts from followed celebrities.

Key Metrics to Estimate

MetricTypical FormulaExample
Peak QPSDAU × requests/user/day ÷ seconds/day × peak factor10M × 20 ÷ 86,400 × 3 ≈ 7,000 QPS
Storage growthUsers × events/user/day × days/year × bytes/event10M × 100 × 365 × 500 B ≈ 180 TB/year
Cache sizehot items × item size1M items × 1 KB = 1 GB
BandwidthPeak QPS × average response size7,000 × 50 KB = 350 MB/s
Read replicas neededPeak read QPS ÷ single DB read capacity50K ÷ 10K = 5 replicas
These are order-of-magnitude estimates. Precise calculations require profiling your actual workload. The goal in an interview is to demonstrate you can anchor decisions in numbers rather than guessing.