System Design Question: Design WhatsApp

#108: System Design Interview - Part 2

Dec 16, 2025

∙ Paid

Share this post & I’ll send you some rewards for the referrals.
Block diagrams created using Eraser.

Designing a real-time chat app like WhatsApp is a popular system design interview question.

In a short period, you need to show you understand not just the happy path, but also all the things that can go wrong when messages fly between disconnected users on mobile networks.

In Part 1, we covered:

Core architecture and data models
Service discovery and presence management
Handling online and offline message delivery
And many more!

Today we’ll cover:

Scaling from one server to thousands with Redis Pub/Sub
Database sharding and replication strategies
Message storage service internals
Scheduled jobs and cleanup
Message deduplication, failure recovery, and edge cases
End-to-end encryption1 and key management
Rate limiting and abuse prevention
Analytics and monitoring infrastructure
Multi-region deployment and global distribution
Schema versioning and backward compatibility

By the end, you’ll understand not just what to build, but why each decision is made and what breaks when you get it wrong.

Onward.

Software engineers, are you prepared if layoffs happen again? (Sponsor)

Build a GitHub portfolio with 4 production-grade projects.

Build real AI systems: autonomous agents, recommendation engines, and multimodal pipelines.
Master the modern stack: LangChain, TensorFlow, PyTorch, multi-agent architectures, and RAG systems.

Flexible & Practical:

5-month, fully online, project-based program for working engineers
Live hands-on workshops + 24/7 support

Earn up to 56% more by mastering production-grade AI.

Next cohort starts on 19 January 2026.

Imagine leading AI initiatives with confidence:

CLICK HERE TO KNOW MORE

I want to reintroduce Hayk Simonyan as a guest author.

He’s a senior software engineer specializing in helping developers break through their career plateaus and secure senior roles.

If you want to master the essential system design skills and land senior developer roles, I highly recommend checking out Hayk’s YouTube channel.

His approach focuses on what top employers actually care about: system design expertise, advanced project experience, and elite-level interview performance.

Scaling

Our single chat server design works great for 1000 users. But it’s terrible for a billion.

So let’s scale horizontally...

Problem: Users on Different Servers

User A connects to Server 1. User B connects to Server 2.

So how does a message from User A reach B?

Option 1: Load balancer does nothing

This approach doesn’t work. Server 1 receives the message but has no connection to User B.

Option 2: Kafka topics per user

Create a Kafka topic (a category or feed name to which messages are published) for each user. Servers subscribe to topics for their connected users.

Why this approach won’t work?

Kafka needs about 50KB of overhead per topic.

1 billion users = 50 TB just for topic metadata.

Kafka wasn’t built for billions of topics. It’s designed for thousands of high-throughput topics, not billions of low-throughput ones.

Option 3: Consistent hashing

Hash user IDs to specific servers2.

Always route the same user to the same server. When a message arrives for User B, forward it directly to Server 2.

Pros:

Direct server-to-server communication,
No extra infrastructure.

Cons:

All-to-all connections between servers (if you have 1000 servers, each needs connections to 999 others).
Adding new servers requires careful coordination to avoid dropping messages during the transition.
Servers need to be big to keep the fleet small.

Option 4: Redis Pub/Sub (what we actually use)

This is the cleanest solution at our scale.

How it works:

Redis Pub/Sub3 is a lightweight messaging system built into Redis. It lets servers subscribe to channels and publish messages to those channels. Think of it like a radio station (channel) where servers can tune in to listen and broadcast messages.

When User B connects to Chat Server 2, Server 2 subscribes to Redis channel of user:B.
When Server 1 needs to deliver a message to User B, Server 1 publishes the message to Redis channel user:B.
Redis routes it to Server 2. Then Server 2 delivers to User B via WebSocket4.

Why we chose this:

Redis Pub/Sub uses only a few bytes per channel (versus Kafka’s 50KB per topic).

It’s “at most once” delivery, meaning if no one’s listening when you publish, the message is lost. But that’s fine for our use case. We already have durability through the Message Queue5 and database. Pub/Sub is just for real-time delivery to online users. If they’re offline, they still get messages from their inbox.

The beauty is that chat servers don’t need to know about each other. They talk only to Redis. Add or remove chat servers at will. No coordination needed.

Tradeoffs:

Redis Pub/Sub adds a few milliseconds of latency (single-digit, usually under 5ms) because messages bounce through Redis.

But this is negligible compared to network latency.

We also need to run a Redis cluster for high availability, which adds operational complexity. Yet the simplicity of not managing server-to-server connections makes it worth it.

Message Storage Service: The Detailed Flow

The Message Storage Service sits between Message Queue and Database.

Let’s break down exactly what it does…

Architecture:

→ Message Queue (Kafka)
→ Message Storage Service instances
→ Database Cluster

Each instance:

Consumes from assigned Kafka partitions
Batches writes for efficiency
Handles retries and failures
Updates indexes

Detailed Processing Flow:

Step 1: Consuming from Queue

Storage Service Instance:

Subscribes to Kafka partitions 0-9 (out of 100 total)
Polls for new messages every 100ms
Receives a batch of 500 messages
Processes batch in parallel (thread pool of 20 workers)

Step 2: Message Validation and Enrichment

For each message:

Validate schema (required fields present)
Check message size limits (text < 4KB, no huge strings)
Enrich with metadata:
- Calculate message hash (for deduplication)
- Extract mentions (@username patterns)
- Identify links for preview generation
- Classify content type
Generate indexes needed for queries

Step 3: Batch Writing

Instead of writing messages one-by-one:

Accumulate messages in the memory buffer
When buffer hits 1000 messages OR 5 seconds elapsed:
- Sort messages by target shard
- Group messages by conversation
- Prepare batch INSERT statements

3. Write the entire batch in one operation per shard

Why batching matters:

A single database INSERT takes ~5ms.
1000 individual INSERTs = 5 seconds.
One batched INSERT of 1000 messages = 50ms.

That’s a 100x throughput improvement.

Step 4: Writing to Database (Cassandra)

For each shard:

Cassandra processes this batch atomically. All succeed, or all fail.

Step 5: Updating Indexes

Simultaneously with message write:

Update conversation index:
- Latest message timestamp
- Unread count
- Last message preview
Update search index (if full-text search enabled):
- Extract keywords
- Send to Elasticsearch cluster
Update analytics:
- Increment counters (messages/day, active users)
- Send to analytics pipeline (separate Kafka topic)

Step 6: Handling Failures

If write fails:

Log error with message details
Check error type:
- Temporary (connection timeout): Retry with exponential backoff6
- Permanent (validation error): Move to dead-letter7 queue for investigation
- Duplicate key: Skip (already written, this is retry)
Don’t block other messages
Update monitoring metrics

Step 7: Committing Offset

After a successful write:

Commit Kafka offset (mark messages as processed)
This ensures we don’t reprocess on restart
If service crashes before commit:
- On restart, reprocess from last committed offset
- Duplicates handled by checking message_id

Scaling the Storage Service:

Each instance is stateless. To handle more throughput:

Add more instances (they auto-distribute Kafka partitions)
Each instance processes independently
No coordination needed between instances
Kafka handles partition assignment

If you have 10 instances and add an 11th, Kafka automatically rebalances. Some instances give up partitions to the new instance. Within seconds, the load is redistributed.

Monitoring Critical Metrics:

Consumer lag: How many messages behind are we?
- (Alert if lag > 10,000 messages)
Write latency: How long does each batch take?
- (Alert if p99 > 500ms)
Error rate: How many failures?
- (Alert if > 0.1%)
Throughput: Messages per second processed
- (Scale up if maxed out)

Tradeoffs:

Message Storage Service adds latency (messages sit in the queue before being written).

But it decouples chat servers from database performance. Chat servers can acknowledge messages immediately without waiting for database writes.

This keeps user-facing latency low even if the database is temporarily slow.

End-to-End Encryption and Key Management

WhatsApp’s defining feature is end-to-end encryption.

Messages are encrypted on the sender’s device and only decrypted on the recipient’s device. Even WhatsApp’s servers can’t read the content.

The challenge is how to encrypt messages when users might have multiple devices, be offline, or need to verify identities?

Signal Protocol (What WhatsApp Uses):

WhatsApp implements the Signal Protocol, which provides:

End-to-end encryption
Forward secrecy8 (compromising one message doesn’t compromise others)
Future secrecy (compromising a key doesn’t compromise future messages)

Key Exchange Flow:

When User A first messages User B:

Identity Keys (long-term):
- Each user generates an identity key pair on registration
- Public key gets uploaded to server
- Private key never leaves device
Prekeys (medium-term):
- User B generates 100 signed prekeys
- Uploads to server
- Server stores them for future initiators
One-time Prekeys:
- User B generates additional one-time prekeys
- Used once and discarded
- Provides forward secrecy
Session Establishment:
- User A fetches User B’s public identity key and a prekey from server
- User A generates ephemeral key pair
- Performs Diffie-Hellman key exchange
- Derives shared secret (session key)
- Encrypts message with session key
- Sends encrypted message + A’s ephemeral public key
User B receives:
- Uses their private keys + A’s ephemeral public key
- Derives the same session key
- Decrypts message
- Establishes bidirectional encrypted session

Message Encryption:

Each message is encrypted with:

Encrypted Message = AES-256(message, session_key + message_counter)

MAC = HMAC-SHA256(encrypted_message, mac_key)

The message_counter prevents replay attacks. Each message increments the counter.

Ratcheting (Forward Secrecy):

After each message, keys are “ratcheted” forward:

New ephemeral key pair generated
New session key derived
Previous keys discarded

This means compromising today’s key doesn’t reveal:

Yesterday’s messages,
Or tomorrow’s messages.

Multi-Device Support:

User A has a phone and a laptop.

How does encryption work?

Approach 1: Sender Keys (Groups):

For groups, sending individual encrypted copies to each device is expensive.

Instead:

User A generates a sender key for the group
Encrypts sender key for each member using their session key
Distributes sender key
All future messages get encrypted with sender key
Much more efficient for large groups

Approach 2: Session Per Device (1:1):

For 1:1 chats.

Each device has its own identity
User A’s phone and laptop are separate entities
When messaging User B, send two encrypted copies (one for each of B’s devices)