How Meta Achieves 99.99999999% Cache Consistency 🎯
#52: Break Into Meta Engineering (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
Why it matters: A fundamental way to scale a distributed system is to avoid coordination between components.
And cache helps to avoid coordination needed to access the database. So cache data correctness is important for scalability.
Share this post & I'll send you some rewards for the referrals.
Once upon a time, Facebook ran a simple tech stack - PHP & MySQL.
But as more users joined, they faced scalability problems.
So they set up a distributed cache1.
Although it temporarily solved their scalability issue, maintaining fresh cache data became difficult. Here’s a common race condition:
The client queries the cache for a value not present in it
So the cache queries the database for the value: x = 0
In the meantime, the value in the database gets changed: x = 1
But the cache invalidation event reaches the cache first: x = 1
Then the value from cache fill reaches the cache: x = 0
Now database: x = 1, while cache: x = 0. So there’s cache inconsistency.
Yet their growth rate was explosive. And became the third most visited site in the world.
Now they serve a quadrillion (1015) requests per day. So even a 1% cache miss rate is expensive - 10 trillion cache fills2 a day.
This post outlines how Meta uses observability to improve cache consistency. It doesn’t cover how they invalidate cache but how they find when to invalidate cache. You will find references at the bottom of this page if you want to go deeper.
Between the lines: This case study assumes a simple data model and the database & cache are aware of each other.
Cache Consistency
Cache inconsistency feels like data loss from a user’s perspective.
So they created an observability solution.
And here’s how they did it:
1. Monitoring 📈
They created a separate service to monitor cache inconsistency & called it Polaris.
Here’s how Polaris works:
It acts like a cache server & receives cache invalidation events
Then it queries cache servers to find data inconsistency
It queues inconsistent cache servers & checks again later
It checks data correctness during writes, so finding cache inconsistency is faster
Simply put, it measures cache inconsistency
Besides there’s a risk of network partition between distributed cache & Polaris. So they use a separate invalidation event stream between the client & Polaris.
A simple fix for cache inconsistency is to query the database.
But there’s a risk of database overload at a high scale. So Polaris queries the database at timescales of 1, 5, or 10 minutes. It lets them back off efficiently & improve accuracy.
2. Tracing 🔎
Debugging a distributed cache without logs is hard.
And they wanted to find out why cache inconsistency occurs each time. Yet logging every data change isn’t scalable as it’s write-heavy. While the cache is for a read-heavy workload. So they created a tracing library & embedded it on each cache server.
Here’s how it works:
It logs only data changes that occur during the race condition time window. Thus log storage becomes cheaper
It keeps an index of recently modified data to determine if the next data change must be logged
Polaris reads logs if cache inconsistency is found & then sends notifications
While absence of logs indicates a missing cache invalidation event.
The bottom line: Polaris finds cache inconsistency faster while tracing finds why it occurred.
Cache invalidation is one of the hard things in computer science. And this is an attempt to strengthen some distributed system properties3 without having the same coordination level as the database.
Now Meta supports 10 nines of cache consistency - 99.99999999%. Put simply, only 1 out of 10 billion cache writes become inconsistent.
And these techniques can be used with cache servers at any scale.
👋 PS - Are you unhappy at your current job?
And preparing for system design interviews to get your dream job can be stressful.
Don't worry, I'm working on content to help you pass the system design interview. I'll make it easier - you spend only a few minutes each week to go from 0 to 1. Yet paid subscription fees will be higher than current pledge fees.
So pledge now to get access at a lower price.
"This newsletter is an amazing resource to learn system design." Alex
Subscribe to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
Cache Made Consistent – Cache Invalidation Might No Longer Be a Hard Thing
Cache made consistent - Meta’s cache invalidation solution on HackerNews
A cache service that is shared across many servers
Adding data into the cache
Subscribe to The System Design Newsletter
A weekly newsletter to help you pass system design interview and become good at work
Scaling is always a challenge; sometimes, you need to build your solutions.
Pretty cool breakdown, Neo!
Caching is something to get right. Do you have resources for increasing api response calla?