Core concepts
Topic
A topic is a named, ordered, and durable log of messages. Every message a producer sends is appended to a topic. Topics are the primary abstraction for categorizing data — you might have separate topics fororders, payments, and user-events.
Partition
Each topic is divided into one or more partitions. A partition is an ordered, append-only sequence stored as a file on a broker’s disk. Partitioning enables two things:- Parallelism: multiple consumers can read different partitions simultaneously, increasing aggregate throughput.
- Scalability: partitions can be distributed across multiple broker machines, so a topic’s capacity is not limited by a single node.
Within a consumer group, each partition is assigned to exactly one consumer at a time. If you have more consumers than partitions, the extra consumers sit idle. Always set your partition count at or above the maximum number of concurrent consumers you expect to run.
Broker
A broker is a Kafka server process. You run multiple brokers to form a cluster. Each broker stores a subset of partitions. For each partition, one broker is the leader (handles all reads and writes) and the remaining brokers that hold replicas are followers (replicate from the leader).Consumer group
A consumer group is a named set of consumers that collectively consume a topic. Kafka automatically distributes partitions among the consumers in a group:- Each partition is consumed by exactly one consumer in the group.
- Different groups consuming the same topic are fully independent — they each maintain their own offsets and do not interfere with each other.
Producer patterns and partition selection
A producer publishes messages to a topic. When writing, it must determine which partition to target. Kafka applies three rules in order:- Explicit partition: the producer specifies a partition number directly.
- Key-based: if the message carries a key, Kafka hashes the key and maps the result to a partition. All messages with the same key land in the same partition, preserving per-key ordering.
- Round-robin: if neither a partition nor a key is specified, Kafka distributes messages across partitions in a round-robin fashion.
Write flow
acks setting.
Delivery guarantees
- At-most-once
- At-least-once
- Exactly-once
The producer sends the message and does not retry on failure. Messages may be lost but are never duplicated.Configuration:
acks=0 (fire-and-forget).When to use: metrics or log aggregation where a small percentage of loss is acceptable and throughput is paramount.Consumer groups and offset management
Offsets
Every consumer in a group tracks its read position in each partition using an offset. After processing a batch of messages, the consumer commits its offset back to Kafka (stored in the internal__consumer_offsets topic). On restart or rebalance, the consumer resumes from its last committed offset.
Two commit strategies:
| Strategy | How it works | Risk |
|---|---|---|
| Auto-commit | Kafka periodically commits the offset automatically (enable.auto.commit=true) | Message loss: auto-commit may advance the offset before processing completes; if the consumer crashes, that message is skipped |
| Manual commit | Your code calls commitSync() or commitAsync() after processing | Duplicate delivery: if the consumer crashes after processing but before committing, it reprocesses on restart |
Rebalancing
When consumers join or leave a group, or when new partitions are added, Kafka performs a rebalance to redistribute partitions among the active consumers. During a rebalance, all consumers in the group pause consumption. Minimize rebalance frequency by:- Setting
session.timeout.msandheartbeat.interval.msappropriately — a consumer that misses its heartbeat is considered dead and triggers a rebalance. - Using static group membership (
group.instance.id) so that a restarting consumer re-claims its previous partitions without triggering a full rebalance.
Message ordering guarantees
Kafka guarantees ordering within a partition only. If your application requires all events for a given entity (e.g., all events fororder_id=1234) to be processed in order, use a message key equal to the entity ID. Kafka’s key-based routing ensures all messages for that key land in the same partition and are consumed in order.
How to avoid message loss
Address reliability at each of the three stages:Production
- Use
acks=allto require all in-sync replicas to acknowledge before the producer considers a message written. - Set
retriesto a high value andenable.idempotence=trueto deduplicate retries. - Wrap send calls in try-catch; alert on repeated failures.
Storage
- Set
min.insync.replicasto at least 2 so that data is on multiple brokers before the leader acknowledges. - Use
replication.factor=3for critical topics. - Deploy brokers across availability zones.
Consumption
- Commit offsets only after your business logic completes successfully.
- Implement idempotent consumers using a unique message ID stored in a database to detect and skip redeliveries.
Kafka vs. RabbitMQ
| Dimension | Kafka | RabbitMQ |
|---|---|---|
| Model | Publish/subscribe log | Queue model (point-to-point) |
| Message retention | Configurable retention period; messages persist after consumption | Messages deleted after acknowledgement |
| Replay | Consumers can seek to any offset and re-read old messages | Not possible after acknowledgement |
| Throughput | Very high (millions of messages/second per cluster) | High but lower than Kafka at scale |
| Ordering | Guaranteed within a partition | Guaranteed within a single queue |
| Consumer model | Pull (consumers request messages from brokers) | Push (broker delivers to consumers) |
| Use case fit | Event streaming, CDC, log aggregation, large-scale pipelines | Task queues, RPC patterns, routing via exchange |
| Operational complexity | Higher (ZooKeeper or KRaft, partition management) | Lower (simpler single-node setup) |
Common patterns
Event sourcing
Instead of storing only the current state of a record, you store an ordered log of every event that changed that record. Kafka serves as the event log. Any downstream service can rebuild the current state by replaying the partition from offset 0. Example: anaccount-events topic receives AccountOpened, FundsDeposited, FundsWithdrawn events. A consumer replays the log to compute the current balance.
Change-data capture (CDC)
Tools like Debezium or Canal connect to a database’s replication log (MySQL binlog, PostgreSQL WAL) and publish every row-level change as a Kafka message. Downstream services consume these change events to keep caches, search indexes, or analytics systems in sync without modifying application code.Stream processing
Frameworks like Apache Flink, Kafka Streams, and Apache Spark Structured Streaming read from Kafka topics, apply transformations (filtering, aggregation, joining), and write results to output topics or external sinks. This lets you compute real-time metrics, detect anomalies, or enrich events with reference data as they flow through the system.Handling message backlog
If consumers fall behind producers, the backlog grows. Resolution strategies:- Find the root cause: check for processing bugs causing retries or unusually slow business logic.
- Optimize the consumer: reduce per-message processing time by batching database writes or parallelizing work within a single consumer.
- Scale horizontally: increase the partition count on the topic and add more consumer instances to the group. Partition count must increase first — a consumer with no partition assigned contributes nothing.