The Design Framework
Follow these four steps in every system design interview. Spend your time roughly as shown — interviewers expect you to reach a working design, not spend fifteen minutes on requirements.Clarify requirements (5 minutes)
Never assume you understand the requirements. Ask about:
- Functional requirements: What does the system actually do? What features are in scope for this session?
- Non-functional requirements: What are the latency targets (p99 < 100 ms?), availability expectations (99.9%? 99.99%?), and consistency requirements (eventual vs. strong)?
- Scale: How many users? Daily active users vs. total users? Read-heavy or write-heavy? What are the peak QPS expectations?
- Constraints: Any existing infrastructure to integrate with? Budget or technology constraints?
Estimate scale (5 minutes)
Back-of-the-envelope numbers guide every subsequent decision. Use round numbers — precision is not the point.QPS example: 100M daily active users × 10 requests/day = 1B requests/day ÷ 86,400 seconds ≈ 12,000 QPS average. Assume 3–5× peak: ~50,000 QPS peak.Storage example: If each user object is 1 KB and you have 100M users = 100 GB. If you store 1 year of events at 1 KB each and users generate 10 events/day = 100M × 10 × 365 × 1 KB ≈ 365 TB/year.Bandwidth: 50,000 QPS × 1 KB average response = 50 MB/s outbound. If responses are 100 KB (e.g., image thumbnails): 5 GB/s.These numbers tell you whether a single database can handle the read load (typically yes up to ~10K QPS with caching), whether you need sharding, and whether a CDN is non-negotiable.
Design the high-level architecture (15 minutes)
Sketch the major components and data flows. A typical backend system includes:
- Clients (web, mobile, third-party)
- DNS + CDN for static assets and geographic distribution
- Load balancer distributing traffic across API servers
- API servers (stateless; horizontally scalable)
- Primary database (source of truth)
- Cache layer (Redis, Memcached)
- Message queue (Kafka, RabbitMQ) for async workloads
- Background workers consuming from the queue
- Object storage (S3) for blobs
Deep dive into critical components (10 minutes)
The interviewer will usually steer you toward one or two areas to explore further. Common deep dives:
- Database schema and indexing: Which columns need indexes? How would you handle N+1 queries?
- Caching strategy: What do you cache? How long? What is your cache invalidation strategy?
- Sharding: How do you partition the data? By user ID hash? By geography? What are the hotspot risks?
- Failure modes: What happens when a service goes down? When the database is unavailable? When the cache is cold?
- Consistency model: Do you need strong consistency or is eventual consistency acceptable? Where are the tradeoffs?
Scalability Patterns
Horizontal vs. Vertical Scaling
Vertical scaling means adding more CPU, RAM, or faster storage to a single machine. It is simple operationally — no code changes — but has a hard ceiling, costs superlinearly, and creates a single point of failure. Horizontal scaling means adding more machines. It requires stateless services (or explicit state management), a load balancer, and distributed data, but it scales to arbitrary load and improves fault tolerance. In practice: scale vertically until the cost or ceiling becomes a problem, then move to horizontal scaling for stateless tiers (API servers, workers) and vertical + read replicas for databases before introducing full horizontal sharding.Load Balancing Strategies
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Requests distributed in sequence across servers | Homogeneous servers with similar request costs |
| Weighted Round Robin | Servers get traffic proportional to their weight | Mixed hardware capacity |
| Least Connections | New request goes to the server with fewest active connections | Variable request duration (long-lived connections) |
| IP Hash | Client IP determines which server handles requests | Session affinity (sticky sessions) without a session store |
| Random | Randomly select a server | Similar to round robin, statistically equivalent at scale |
/api/v1/upload to a high-memory server pool and /api/v1/query to a compute-optimized pool.
Caching Layers
Cache at multiple levels for maximum throughput:- Client-side / browser cache: HTTP caching headers (
Cache-Control,ETag) prevent unnecessary requests entirely. - CDN: Geographic caching for static assets and cacheable API responses. Dramatically reduces origin load and latency for global users.
- Application cache (Redis/Memcached): Cache expensive database query results, aggregated data, or computed values. Typical TTL-based invalidation.
- Database query cache: MySQL’s built-in query cache (deprecated in MySQL 8.0) or result caching in the ORM layer.
Storage Selection Guide
Choosing the right storage system for each use case is a core system design skill.| System | Best For | Tradeoffs |
|---|---|---|
| MySQL / PostgreSQL | Structured relational data requiring ACID transactions, complex JOINs, and strong consistency. User accounts, financial records, order management. | Not designed for horizontal write sharding; vertical scaling has limits. |
| Redis | Sub-millisecond read/write access, caching, rate limiting, session storage, leaderboards, pub/sub. Data fits in memory. | Memory is expensive; persistence is secondary to performance; limited query capability. |
| Elasticsearch | Full-text search, log aggregation, fuzzy matching, faceted filtering. | Eventually consistent, operationally complex, not a source of truth. Write throughput is lower than a relational DB. |
| Kafka | High-throughput event streaming, durable message queuing, event sourcing, real-time analytics pipelines. Consumers can replay from any offset. | Not a database; designed for append-only streams; operational overhead. |
| S3 / Object Storage | Blobs: images, videos, backups, large files. Unlimited scale, cheap storage per GB. | No transactions, no query capability, higher latency than in-memory systems. |
| Cassandra | High write throughput at massive scale, multi-region active-active, time-series data. | Eventual consistency, limited query patterns (no arbitrary JOINs), schema design is access-pattern-driven. |
Common System Design Problems
URL Shortener
Core requirements: Given a long URL, generate a short code (e.g.,flightaware.com/s/abc123). Redirect short codes to the original URL.
Approach:
- Generate a 6–8 character base62 code (a–z, A–Z, 0–9) for the short path. Use a counter + base62 encoding or a hash of the long URL truncated to 6 characters.
- Store the mapping in MySQL:
(short_code, long_url, created_at, user_id). Index onshort_codefor O(log n) lookup. - Cache hot short codes in Redis with a TTL. At 100K QPS, most codes are cold — cache the top 20% that account for 80% of traffic.
- Redirect with HTTP 301 (permanent, client caches) or 302 (temporary, each click hits your server). 302 gives accurate analytics; 301 reduces server load.
Rate Limiter
Core requirements: Limit each user to N requests per time window. Approach:- Token bucket algorithm: Each user has a bucket with capacity N. Tokens refill at a fixed rate. A request consumes one token; if the bucket is empty, reject the request. Allows short bursts up to capacity.
- Fixed window counter: Count requests per user per window (e.g., per minute). Simple but allows 2× the rate at window boundaries (burst at end of one window + start of next).
- Sliding window log: Store timestamps of recent requests in a sorted set. On each request, remove expired timestamps and check the count. Accurate but memory-intensive.
INCR is atomic.
Notification System
Core requirements: Send notifications (push, email, SMS) to millions of users. Some are time-sensitive (payment alerts), some are marketing. Approach:- Decouple with a queue: API server publishes notification events to Kafka. Worker pools consume from Kafka and call third-party providers (APNs, FCM, SendGrid, Twilio).
- Fanout service: For broadcast notifications (all users, or segments), a fanout service reads user segments and publishes one event per user. Use separate Kafka topics for push, email, and SMS to allow independent scaling.
- Priority queues: Put time-sensitive notifications on a high-priority topic with a dedicated consumer pool. Marketing messages go to a lower-priority topic processed during off-peak hours.
- Delivery tracking: Store notification status in MySQL (pending → sent → delivered → failed). Retry failed deliveries with exponential backoff.
News Feed / Timeline
Core requirements: Each user sees a reverse-chronological feed of posts from the people they follow. Pull model (fan-out on read): When a user loads their feed, query all accounts they follow, fetch recent posts from each, merge and sort. Simple writes but expensive reads — N database queries per feed load. Push model (fan-out on write): When a user creates a post, write it to every follower’s feed cache immediately. Feed reads are O(1) Redis reads. Expensive writes — a celebrity with 10M followers generates 10M write operations per post. Hybrid model (used by most large systems): Push for average users; pull for celebrities. Cache the feed in Redis as a sorted set keyed by timestamp. On read, merge the precomputed feed with the latest posts from followed celebrities.Key Metrics to Estimate
| Metric | Typical Formula | Example |
|---|---|---|
| Peak QPS | DAU × requests/user/day ÷ seconds/day × peak factor | 10M × 20 ÷ 86,400 × 3 ≈ 7,000 QPS |
| Storage growth | Users × events/user/day × days/year × bytes/event | 10M × 100 × 365 × 500 B ≈ 180 TB/year |
| Cache size | hot items × item size | 1M items × 1 KB = 1 GB |
| Bandwidth | Peak QPS × average response size | 7,000 × 50 KB = 350 MB/s |
| Read replicas needed | Peak read QPS ÷ single DB read capacity | 50K ÷ 10K = 5 replicas |
These are order-of-magnitude estimates. Precise calculations require profiling your actual workload. The goal in an interview is to demonstrate you can anchor decisions in numbers rather than guessing.