The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
How Facebook Was Able to Support a Billion Users via Software Load Balancer ⚡

How Facebook Was Able to Support a Billion Users via Software Load Balancer ⚡

#58: Break into Meta Engineering (5 minutes)

Neo Kim's avatar
Neo Kim
Oct 11, 2024
118

Share this post

The System Design Newsletter
The System Design Newsletter
How Facebook Was Able to Support a Billion Users via Software Load Balancer ⚡
7
10
Share

Get my system design playbook for FREE on newsletter signup:


This post outlines how Facebook scaled its load balancer to a billion users. You will find references at the bottom of this page if you want to go deeper.

  • Share this post & I'll send you some rewards for the referrals.

Note: This post is based on my research and may differ from real-world implementation.

Once upon a time, Facebook ran on a single server.

Life was good.

But as more users joined, they faced scalability issues.

Although adding more servers temporarily solved their scalability problems, routing traffic became difficult.

So they were sad & frustrated.

Facebook Load Balancer
Image Created With imgflip

Until one morning when they had a smart idea to create a software load balancer.

Yet they wanted to keep it reliable and offer extreme scalability.

So they used simple ideas to build it.

Onward.


This Post Summary - Instagram

I wrote a summary of this post

(save it for later):

systemdesignone
A post shared by @systemdesignone

And I’d love to connect if you’re on Instagram:

Follow Instagram


Facebook Load Balancer

Here’s how they scaled the load balancer from 0 to 1 billion users:

1. One Million Users

They send encrypted data using Transport Layer Security (TLS). Think of TLS as a security protocol.

And the server must decrypt the data before processing it. This is called TLS termination.

Yet it needs a ton of computing power.

Layer 7 Load Balancer Proxying Requests to the Server
Layer 7 Load Balancer Proxying Requests to the Server
  • So they do TLS termination on the load balancer. And forward decrypted data to servers. Thus reducing server load.

They use a layer 7 load balancer to proxy requests. It routes traffic based on application data - URLs, HTTP headers, and so on.

2. Ten Million Users

Yet a single layer 7 load balancer might crash with high traffic.

Layer 4 Load Balancer Routing Requests to Layer 7 Load Balancers
Layer 4 Load Balancer Routing Requests to Layer 7 Load Balancers

So they added more layer 7 load balancers and put a layer 4 load balancer in front.

A layer 4 load balancer does Transmission Control Protocol (TCP) routing. Put simply, it doesn’t inspect application-level data. Instead forwards traffic based on IP address & port number.

Thus offering better performance.

Also they use consistent hashing for routing requests to layer 7 load balancers.

  • The IP addresses, and port numbers get hashed to find the target load balancer.

While a request gets routed to the next layer 7 load balancer if the original layer 7 load balancer fails. And future requests get routed to the original load balancer once it’s live again.

Service Discovery to Find Layer 7 Load Balancers
Service Discovery to Find Layer 7 Load Balancers
  • They use Apache Zookeeper for service discovery. It lets the layer 4 load balancer find available layer 7 load balancers.

Besides a TCP connection must be created each time to send requests.

Yet it’s not efficient to create one every time due to connection overhead.

  • So they maintain open TCP connections between layer 4 and layer 7 load balancers.

And keep a separate state table with the list of open connections. It lets them do failover quickly if the layer 4 load balancer fails.

3. Hundred Million Users

But a single layer 4 load balancer might crash with explosive traffic.

Router Proxying Requests to Layer 4 Load Balancers
Router Proxying Requests to Layer 4 Load Balancers

So they added more layer 4 load balancers and put a router in front.

Here’s how the router works:

  • It uses Border Gateway Protocol (BGP) to find the best routes to the layer 4 load balancer

  • It uses the Equal Cost Multi Path (ECMP) routing technique to load balance traffic across those routes

Thus avoiding network congestion on a single route and making better use of bandwidth.

4. One Billion Users

Yet the router is now a single point of failure.

Replicating Clusters Within a Data Center for High Availability
Replicating Clusters Within a Data Center for High Availability

So they added more clusters within a data center - each cluster consists of a router and load balancers.

And set up extra data centers for more computing capacity.

Creation of DNS Map
Creation of DNS Map

Besides they do smart DNS load balancing across data centers.

  • A network map is created in real time based on server capacity, server health, and so on.

The requests then get routed to the optimal server based on this network map.


They use software load balancers for flexibility.

And use commodity hardware everywhere except for the router.

While Facebook became one of the most visited sites with ~3 billion users.


Subscribe to get simplified case studies delivered straight to your inbox:


Author Neo Kim; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

system design newsletter

Share


How Google Search Works 🔥

How Google Search Works 🔥

Neo Kim
·
September 30, 2024
Read full story
Amazon Frugal Architecture Explained 💰

Amazon Frugal Architecture Explained 💰

Neo Kim
·
September 15, 2024
Read full story

References

  • Building A Billion-User Load Balancer

  • Connecting the World: A Look into Facebook’s Networking Infrastructure

  • Open-sourcing Katran, a scalable network load balancer

  • Moving fast with high-performance Hack

  • What is TLS (Transport Layer Security)?

  • Linux Virtual Server

  • What is SSL Termination?

  • What is Direct Server Return (DSR)?

  • What is the OSI Model?

  • Facebook Data Center Locations

Muhammad Adeel's avatar
Bhagwan Sahane's avatar
erdinç özkan's avatar
Koushik Thota's avatar
Raul Junco's avatar
118 Likes∙
10 Restacks
118

Share this post

The System Design Newsletter
The System Design Newsletter
How Facebook Was Able to Support a Billion Users via Software Load Balancer ⚡
7
10
Share

Discussion about this post

User's avatar
Orel Zilberman's avatar
Orel Zilberman
Dec 6Edited

It's fascinating to see how to build a system, then they replicate that system with a load balancer, just to replicate that system with another form of load balancer and so on.

I wonder if they thought about changing something within the first system instead of replicating it.

Fantastic article Neo!

Expand full comment
Like (1)
Reply
Share
1 reply by Neo Kim
Sudhanshu Shekhar's avatar
Sudhanshu Shekhar
Oct 21

Interesting to see how different strategies are applied at layer 7 and 4 to achieve overall load balancing improving scalability by multiple times.

Expand full comment
Like (1)
Reply
Share
1 reply by Neo Kim
5 more comments...
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 • 
Neo Kim
748

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
25
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 • 
Neo Kim
262

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 • 
Neo Kim
396

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
30

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

User's avatar

Fran Soto, a subscriber of The System Design Newsletter, shared this with you.