The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from The System Design Newsletter
Weekly newsletter to help busy engineers become good at system design
Over 147,000 subscribers
Already have an account? Sign in

How Meta Achieves 99.99999999% Cache Consistency 🎯

#52: Break Into Meta Engineering (4 minutes)

Neo Kim's avatar
Neo Kim
Jul 18, 2024
167

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
3
17
Share

Get the powerful template to approach system design for FREE on newsletter sign-up:


Why it matters: A fundamental way to scale a distributed system is to avoid coordination between components.

And cache helps to avoid coordination needed to access the database. So cache data correctness is important for scalability.

  • Share this post & I'll send you some rewards for the referrals.

Once upon a time, Facebook ran a simple tech stack - PHP & MySQL.

But as more users joined, they faced scalability problems.

So they set up a distributed cache1.

Although it temporarily solved their scalability issue, maintaining fresh cache data became difficult. Here’s a common race condition:

  1. The client queries the cache for a value not present in it

  2. So the cache queries the database for the value: x = 0

  3. In the meantime, the value in the database gets changed: x = 1

  4. But the cache invalidation event reaches the cache first: x = 1

  5. Then the value from cache fill reaches the cache: x = 0

Race Condition During Cache Invalidation
Race Condition During Cache Invalidation

Now database: x = 1, while cache: x = 0. So there’s cache inconsistency.

Yet their growth rate was explosive. And became the third most visited site in the world.

Cache Consistency

Now they serve a quadrillion (1015) requests per day. So even a 1% cache miss rate is expensive - 10 trillion cache fills2 a day.

This post outlines how Meta uses observability to improve cache consistency. It doesn’t cover how they invalidate cache but how they find when to invalidate cache. You will find references at the bottom of this page if you want to go deeper.

Between the lines: This case study assumes a simple data model and the database & cache are aware of each other.

system design newsletter

Cache Consistency

Cache inconsistency feels like data loss from a user’s perspective.

So they created an observability solution.

And here’s how they did it:

1. Monitoring 📈

They created a separate service to monitor cache inconsistency & called it Polaris.

Polaris Monitoring Cache Inconsistency
Polaris Monitoring Cache Inconsistency

Here’s how Polaris works:

  • It acts like a cache server & receives cache invalidation events

  • Then it queries cache servers to find data inconsistency

  • It queues inconsistent cache servers & checks again later

  • It checks data correctness during writes, so finding cache inconsistency is faster

  • Simply put, it measures cache inconsistency

Besides there’s a risk of network partition between distributed cache & Polaris. So they use a separate invalidation event stream between the client & Polaris.

A simple fix for cache inconsistency is to query the database.

But there’s a risk of database overload at a high scale. So Polaris queries the database at timescales of 1, 5, or 10 minutes. It lets them back off efficiently & improve accuracy.

2. Tracing 🔎

Debugging a distributed cache without logs is hard.

And they wanted to find out why cache inconsistency occurs each time. Yet logging every data change isn’t scalable as it’s write-heavy. While the cache is for a read-heavy workload. So they created a tracing library & embedded it on each cache server.

Logging Data Changes Occuring During the Race Condition Window
Logging Data Changes Occurring During the Race Condition Window

Here’s how it works:

  • It logs only data changes that occur during the race condition time window. Thus log storage becomes cheaper

  • It keeps an index of recently modified data to determine if the next data change must be logged

  • Polaris reads logs if cache inconsistency is found & then sends notifications

While absence of logs indicates a missing cache invalidation event.

system design newsletter

The bottom line: Polaris finds cache inconsistency faster while tracing finds why it occurred.

Cache invalidation is one of the hard things in computer science. And this is an attempt to strengthen some distributed system properties3 without having the same coordination level as the database.

Now Meta supports 10 nines of cache consistency - 99.99999999%. Put simply, only 1 out of 10 billion cache writes become inconsistent.

And these techniques can be used with cache servers at any scale.


👋 PS - Are you unhappy at your current job?

And preparing for system design interviews to get your dream job can be stressful.

Don't worry, I'm working on content to help you pass the system design interview. I'll make it easier - you spend only a few minutes each week to go from 0 to 1. Yet paid subscription fees will be higher than current pledge fees.

So pledge now to get access at a lower price.

"This newsletter is an amazing resource to learn system design." Alex


Subscribe to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram | Bluesky

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

system design newsletter

Share


How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

Neo Kim
·
July 4, 2024
Read full story
Why Is Redis a Distributed Swiss Army Knife 💭

Why Is Redis a Distributed Swiss Army Knife 💭

Neo Kim
·
June 20, 2024
Read full story

References

  • Cache made consistent

  • Cache Made Consistent – Cache Invalidation Might No Longer Be a Hard Thing

  • When and How to Invalidate Cache

  • Cache Invalidation

  • Scaling Memcache at Facebook

  • TAO: The power of the graph

  • Cache made consistent - Meta’s cache invalidation solution on HackerNews

  • Marc Brooker on Twitter

  • The Fundamental Mechanism of Scaling

  • Anna: A KVS For Any Scale

  • Rebuilding our tech stack for the new Facebook.com

  • Most Visited Websites In The World (July 2024)

1

A cache service that is shared across many servers

2

Adding data into the cache

3

Bounded staleness vs Linearizability


Subscribe to The System Design Newsletter

By Neo Kim · Launched 2 years ago
Weekly newsletter to help busy engineers become good at system design
Amit Singh's avatar
Alex Rauenzahn's avatar
Luv Singh's avatar
Kujtesa's avatar
Ferit To's avatar
167 Likes∙
17 Restacks
167

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
3
17
Share

Discussion about this post

User's avatar
Raul Junco's avatar
Raul Junco
Jul 18

Scaling is always a challenge; sometimes, you need to build your solutions.

Pretty cool breakdown, Neo!

Expand full comment
Like (3)
Reply
Share
1 reply by Neo Kim
Tolulade Ademisoye's avatar
Tolulade Ademisoye
Jul 19

Caching is something to get right. Do you have resources for increasing api response calla?

Expand full comment
Like
Reply
Share
1 more comment...
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 â€¢ 
Neo Kim
723

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
Copy link
Facebook
Email
Notes
More
24
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 â€¢ 
Neo Kim
234

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
Copy link
Facebook
Email
Notes
More
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 â€¢ 
Neo Kim
374

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
Copy link
Facebook
Email
Notes
More
29

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.