The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from The System Design Newsletter
Download my system design playbook for free on newsletter signup
Over 159,000 subscribers
Already have an account? Sign in

How Meta Achieves 99.99999999% Cache Consistency 🎯

#52: Break Into Meta Engineering (4 minutes)

Neo Kim's avatar
Neo Kim
Jul 18, 2024
174

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
3
19
Share

Get my system design playbook for FREE on newsletter signup:


A fundamental way to scale a distributed system is to avoid coordination between components.

And cache helps to avoid the coordination needed to access the database. So cache data correctness is important for scalability.

  • Share this post & I'll send you some rewards for the referrals.

Once upon a time, Facebook ran a simple tech stack - PHP & MySQL.

But as more users joined, they faced scalability problems.

So they set up a distributed cache1.

Although it temporarily solved their scalability issue, maintaining fresh cache data became difficult. Here’s a common race condition:

  1. The client queries the cache for a value not present in it

  2. So the cache queries the database for the value: x = 0

  3. In the meantime, the value in the database gets changed: x = 1

  4. But the cache invalidation event reaches the cache first: x = 1

  5. Then the value from cache fill reaches the cache: x = 0

Race Condition During Cache Invalidation
Race Condition During Cache Invalidation

Now database: x = 1, while cache: x = 0. So there’s cache inconsistency.

Yet their growth rate was explosive. And became the third most visited site in the world.

Cache Consistency

Now they serve a quadrillion (1015) requests per day. So even a 1% cache miss rate is expensive - 10 trillion cache fills2 a day.

This post outlines how Meta uses observability to improve cache consistency. It doesn’t cover how they invalidate cache, but how they find when to invalidate cache. You will find references at the bottom of this page if you want to go deeper.

This case study assumes a simple data model and the database & cache are aware of each other.

system design newsletter

Cache Consistency

Cache inconsistency feels like data loss from a user’s perspective.

So they created an observability solution.

And here’s how they did it:

1. Monitoring 📈

They created a separate service to monitor cache inconsistency & called it Polaris.

Polaris Monitoring Cache Inconsistency
Polaris Monitoring Cache Inconsistency

Here’s how Polaris works:

  • It acts like a cache server & receives cache invalidation events

  • Then it queries cache servers to find data inconsistency

  • It queues inconsistent cache servers & checks again later

  • It checks data correctness during writes, so finding cache inconsistency is faster

  • Simply put, it measures cache inconsistency

Besides there’s a risk of network partition between distributed cache & Polaris. So they use a separate invalidation event stream between the client & Polaris.

A simple fix for cache inconsistency is to query the database.

But there’s a risk of database overload at a high scale. So Polaris queries the database at timescales of 1, 5, or 10 minutes. It lets them back off efficiently & improve accuracy.

2. Tracing 🔎

Debugging a distributed cache without logs is hard.

And they wanted to find out why cache inconsistency occurs each time. Yet logging every data change isn’t scalable as it’s write-heavy. While the cache is for a read-heavy workload. So they created a tracing library & embedded it on each cache server.

Logging Data Changes Occuring During the Race Condition Window
Logging Data Changes Occurring During the Race Condition Window

Here’s how it works:

  • It logs only data changes that occur during the race condition time window. Thus log storage becomes cheaper

  • It keeps an index of recently modified data to determine if the next data change must be logged

  • Polaris reads logs if cache inconsistency is found & then sends notifications

While absence of logs indicates a missing cache invalidation event.

system design newsletter

Polaris finds cache inconsistency faster while tracing finds why it occurred.

Cache invalidation is one of the hard things in computer science. And this is an attempt to strengthen some distributed system properties3 without having the same coordination level as the database.

Now Meta supports 10 nines of cache consistency - 99.99999999%. Put simply, only 1 out of 10 billion cache writes becomes inconsistent.

And these techniques can be used with cache servers at any scale.


Subscribe to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

system design newsletter

Share


How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

Neo Kim
·
July 4, 2024
Read full story
Why Is Redis a Distributed Swiss Army Knife 💭

Why Is Redis a Distributed Swiss Army Knife 💭

Neo Kim
·
June 20, 2024
Read full story

References

  • Cache made consistent

  • Cache Made Consistent – Cache Invalidation Might No Longer Be a Hard Thing

  • When and How to Invalidate Cache

  • Cache Invalidation

  • Scaling Memcache at Facebook

  • TAO: The power of the graph

  • Cache made consistent - Meta’s cache invalidation solution on HackerNews

  • Marc Brooker on Twitter

  • The Fundamental Mechanism of Scaling

  • Anna: A KVS For Any Scale

  • Rebuilding our tech stack for the new Facebook.com

  • Most Visited Websites In The World (July 2024)

1

A cache service that is shared across many servers

2

Adding data into the cache

3

Bounded staleness vs Linearizability


Subscribe to The System Design Newsletter

By Neo Kim · Launched 2 years ago
Download my system design playbook for free on newsletter signup
Amit Singh's avatar
Alex Rauenzahn's avatar
Luv Singh's avatar
Kujtesa's avatar
Ferit To's avatar
174 Likes∙
19 Restacks
174

Share this post

The System Design Newsletter
The System Design Newsletter
How Meta Achieves 99.99999999% Cache Consistency 🎯
Copy link
Facebook
Email
Notes
More
3
19
Share

Discussion about this post

User's avatar
Raul Junco's avatar
Raul Junco
Jul 18

Scaling is always a challenge; sometimes, you need to build your solutions.

Pretty cool breakdown, Neo!

Expand full comment
Like (3)
Reply
Share
1 reply by Neo Kim
Tolulade Ademisoye's avatar
Tolulade Ademisoye
Jul 19

Caching is something to get right. Do you have resources for increasing api response calla?

Expand full comment
Like
Reply
Share
1 more comment...
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 â€¢ 
Neo Kim
744

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
Copy link
Facebook
Email
Notes
More
25
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 â€¢ 
Neo Kim
252

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
Copy link
Facebook
Email
Notes
More
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 â€¢ 
Neo Kim
392

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
Copy link
Facebook
Email
Notes
More
30

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.