

Discover more from System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This is a post outlining the incredible story of WhatsApp co-founder Jan Koum and the engineering techniques used to scale WhatsApp. Consider sharing this post with somebody who wants to study scalability patterns.
January 2008 - California, United States.
Jan Koum, an engineer at Yahoo, applies for work at Facebook - rejected.
This was not the end - he moved on with his life.
He buys an iPhone in the next year and immediately recognizes the huge potential of the new App Store.
He decides to build an instant messenger with some of his former coworkers from Yahoo. They named it WhatsApp. The vision behind WhatsApp was to replace the expensive SMS.
With 1 million people signing up each day, the growth rate of WhatsApp was mind-boggling.
WhatsApp was able to support 50 billion messages a day from 450 million daily active users with only 32 engineers.
Although explosive product growth is a good problem to have, Jan Koum and the team behind WhatsApp had to adopt the best engineering practices to overcome the challenges.
WhatsApp Engineering
The WhatsApp engineering practices to meet extreme scalability is summarized as follows:
1. Single Responsibility Principle
The product focus was always only on the core feature - messaging.
They didn’t bother to build an advertising network or a social media platform.
They eliminated feature creep at all costs.
Feature creep is when you add excessive features to a product that make it too difficult to use.
They focused on the reliability of WhatsApp over everything else.
2. Technology Stack
They implemented the core functionalities of WhatsApp servers with the Erlang programming language. The reasons behind this decision were the following :
provides high scalability with a tiny footprint
supports hot loading
Threads are a native feature of Erlang, unlike Java, or C++, where threads belong to the operating system. There is no need to save the entire CPU state in Erlang due to native threads. This makes context switching cheaper.
Hot loading makes it easier to deploy code changes without a server restart or traffic redirection. In simple words, Hot loading offers high availability.
3. Why Reinvent the Wheel?
Do not reinvent the wheel - either use open source or buy a commercial solution.
Ejabberd is an open-source real-time messaging server written in Erlang.
WhatsApp is built on top of ejabberd. They rewrote some of the ejabberd core components to fit their requirements.
WhatsApp leveraged third-party services such as Google Push to provide push notifications.
4. Cross-Cutting Concerns
Cross-cutting concerns are things that affect many parts of a product and are hard to separate. For example, monitoring and alerting the health of the services.
A huge emphasis on cross-cutting concerns improved product quality.
Continuous integration is the process of merging the code changes regularly into a central repository.
Continuous delivery is the process of code deployment to a testing or production environment. Continuous delivery is often automated.
The software development process at WhatsApp improved with Continuous integration and Continuous delivery.
5. Scalability
Horizontal scaling is the process of increasing the number of machines in the resource pool.
Vertical scaling is the process of increasing the capacity of an existing machine, such as the CPU or memory.
Diagonal scaling is a hybrid of horizontal and vertical scaling. The computing resources get added both vertically and horizontally.
WhatsApp used diagonal scaling to keep the costs and operational complexity low.
They ran WhatsApp servers on the FreeBSD operating system. Because they had previous experience with FreeBSD while working at Yahoo. Besides, FreeBSD offers a reliable network stack.
They fine-tuned FreeBSD to accommodate 2 million+ connections per server. And they modified kernel parameters such as files and sockets.
They overprovisioned servers to handle sudden traffic spikes and keep headroom for failures. For example, failures such as network partitions or hardware faults.
6. Flywheel Effect
They measured the metrics such as CPU, context switches, and system calls. Then identified and eliminated the bottlenecks. They did this at regular intervals.
The continuous feedback cycle tremendously improved the performance of WhatsApp.
7. Quality
Load testing is the process of measuring the performance of the system under the anticipated load.
The load testing identified single points of failure.
Artificial production traffic or DNS configuration to direct more traffic performed load testing.
8. Small Team Size
The communication paths between engineers increase quadratically as the team grows in size. This is a recipe for degraded productivity.
So, they kept the team size small - 32 engineers.
Summary
WhatsApp is one of the most successful instant messengers in the market.
In 2014, the same Facebook that rejected Jan Koum acquired WhatsApp for a whopping 19 billion USD.
According to Forbes, Jan Koum has a net worth of 14 billion USD in 2023.
Consider subscribing to get simplified case studies delivered straight to your inbox:
What to Read Next?
Read this post summary: Twitter Thread, LinkedIn Post, and Substack Notes.
Next Two Weeks: 🚀 New Posts Incoming!
Readership
A huge thank you to everybody who shared their support in the last week’s email. You are now 1,823 subscribers strong, very close to 2k. I think we can get to 2,000 subscribers by September 3rd. Y’all are the best.
Reader Praise
Word-of-mouth referrals like yours help this community grow - Thank you.
Get featured in the newsletter: Write your feedback on this post. And tag me on Twitter, LinkedIn, and Substack Notes. Or, you can reply to this email with anonymous feedback.
References
High Scalability. (2014, February 26). The WhatsApp Architecture Facebook Bought for $19 Billion.
Shopify. (n.d.). Feature Creep: What Causes It and How to Avoid It.
Stack Overflow. (2010, April 25). Technically, why are processes in Erlang more efficient than OS threads?
ejabberd. (n.d.). Home.
Wikipedia. (n.d.). Jan Koum.
Atlassian. (n.d.). Continuous Integration vs. Continuous Delivery vs. Continuous Deployment.
nOps. (2022). Horizontal vs. Vertical Scaling: Which One Is Best for Your Workloads?
JavaTpoint. (n.d.). Scaling in Cloud Computing.
Business Insider. (2015, October 6). The co-founder of WhatsApp explained why his $19 billion app was built using obscure software development tools.
BlazeMeter. (2022). Performance Testing vs. Load Testing vs. Stress Testing. Retrieved
Thumbnail Photo by Anton from Pexels
Rick Reed. WhatsApp: Half a billion unsuspecting FreeBSD users
Anton Lavrik. (2018) A Reflection on Building the WhatsApp Server - Code BEAM
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
Hi,
You write "The communication paths between engineers increase exponentially as the team grows in size.", but in fact the total number of communication paths between n nodes is equal to n * (n-1) / 2, which is not an exponential growth, but an polynomial growth (quadratic to be exact), i.e. O(n^2).