How Meta Achieves 99.99999999% Cache Consistency 🎯

#52: Break Into Meta Engineering (4 minutes)

Jul 18, 2024

Error

Get my system design playbook for FREE on newsletter signup:

A fundamental way to scale a distributed system is to avoid coordination between components.

And cache helps to avoid the coordination needed to access the database. So cache data correctness is important for scalability.

Share this post & I'll send you some rewards for the referrals.

Once upon a time, Facebook ran a simple tech stack - PHP & MySQL.

But as more users joined, they faced scalability problems.

So they set up a distributed cache1.

Although it temporarily solved their scalability issue, maintaining fresh cache data became difficult. Here’s a common race condition:

The client queries the cache for a value not present in it
So the cache queries the database for the value: x = 0
In the meantime, the value in the database gets changed: x = 1
But the cache invalidation event reaches the cache first: x = 1
Then the value from cache fill reaches the cache: x = 0

Race Condition During Cache Invalidation

Now database: x = 1, while cache: x = 0. So there’s cache inconsistency.

Yet their growth rate was explosive. And became the third most visited site in the world.

Now they serve a quadrillion (10¹⁵) requests per day. So even a 1% cache miss rate is expensive - 10 trillion cache fills2 a day.

This post outlines how Meta uses observability to improve cache consistency. It doesn’t cover how they invalidate cache, but how they find when to invalidate cache. You will find references at the bottom of this page if you want to go deeper.

This case study assumes a simple data model and the database & cache are aware of each other.

Cache Consistency

Cache inconsistency feels like data loss from a user’s perspective.

So they created an observability solution.

And here’s how they did it:

1. Monitoring 📈

They created a separate service to monitor cache inconsistency & called it Polaris.

Here’s how Polaris works:

It acts like a cache server & receives cache invalidation events
Then it queries cache servers to find data inconsistency
It queues inconsistent cache servers & checks again later
It checks data correctness during writes, so finding cache inconsistency is faster
Simply put, it measures cache inconsistency

Besides there’s a risk of network partition between distributed cache & Polaris. So they use a separate invalidation event stream between the client & Polaris.

A simple fix for cache inconsistency is to query the database.

But there’s a risk of database overload at a high scale. So Polaris queries the database at timescales of 1, 5, or 10 minutes. It lets them back off efficiently & improve accuracy.

2. Tracing 🔎

Debugging a distributed cache without logs is hard.

And they wanted to find out why cache inconsistency occurs each time. Yet logging every data change isn’t scalable as it’s write-heavy. While the cache is for a read-heavy workload. So they created a tracing library & embedded it on each cache server.

Logging Data Changes Occuring During the Race Condition Window — Logging Data Changes Occurring During the Race Condition Window

Here’s how it works:

It logs only data changes that occur during the race condition time window. Thus log storage becomes cheaper
It keeps an index of recently modified data to determine if the next data change must be logged
Polaris reads logs if cache inconsistency is found & then sends notifications

While absence of logs indicates a missing cache invalidation event.

Polaris finds cache inconsistency faster while tracing finds why it occurred.

Cache invalidation is one of the hard things in computer science. And this is an attempt to strengthen some distributed system properties3 without having the same coordination level as the database.

Now Meta supports 10 nines of cache consistency - 99.99999999%. Put simply, only 1 out of 10 billion cache writes becomes inconsistent.

And these techniques can be used with cache servers at any scale.

Subscribe to get simplified case studies delivered straight to your inbox: