How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒

#53: Break Into Netflix Engineering (4 minutes)

Aug 05, 2024

Get my system design playbook for FREE on newsletter signup:

Chaos engineering is used to build resilient1 distributed systems. You will find references at the bottom of this page if you want to go deeper.

Share this post & I'll send you some rewards for the referrals.

Once upon a time, Netflix offered DVD rentals via mail.

Yet their growth rate was limited.

So they pivoted to create a streaming service.

And set up a monolith tech stack.

Although explosive growth is a good problem, scaling their infrastructure became difficult.

So they set up microservices.

But it created newer problems.

Here are some of them:

1. Reliable Network

Computer networks suffer from:

Latency issues
Network failures
Bandwidth limitations

So there’s a risk of communication failure between services.

2. Resilience

The resilience of a distributed system depends on its weakest component.

While weakest component is usually found after a failure - thus affecting users.

Chaos Engineering

They wanted to manage the chaos present in distributed systems.

So they created chaos engineering.

Here’s how:

1. Implementation

A system with 0 downtime doesn’t exist.

But downtime can be minimized via automation.

So they proactively find potential failures during office hours & then automate the fix. This means failure gets fixed quickly via automation if it occurs again. Observing a distributed system's behavior in a controlled experiment is called chaos engineering.

It helps to find a problem before it causes a production outage.

Chaos Monkey Turning off Traffic to the Main Database

They created a tool to randomly shut down servers and called it chaos monkey:

It gets information about available servers via the continuous delivery platform
It interacts via the continuous delivery platform to shut down servers

Then they check if traffic gets routed to another server without affecting users.

A Systematic Approach to Chaos Engineering — Systematic Approach to Chaos Engineering

Here’s how they do chaos engineering:

Create a hypothesis about how the system will behave during a failure
Run a small test to introduce failure - switch off the server or change the network configuration
Observe the system’s behavior & measure the failure impact
Automate fix for the problem

Then rerun the test to check if the automated fix works as expected.

They run chaos engineering in production traffic for accuracy.

Yet blast radius2 must be controlled to avoid affecting users. Here’s how:

Prepare for the worst case with a backup plan
Use feature flags to roll back changes quickly if things go wrong
Run tests in pre-production before production
Run only small tests first & then scale
Introduce one chaos variable at a time - don’t break everything together

Besides they measure test impact properly to prevent unnecessary damage.

2. Principles

They created chaos engineering around these principles:

Automate tests to save cost & time
Run tests in production for the same traffic pattern & reliable results
Run tests with events based on potential impact & frequency - server crash, wrong API response, traffic spike
Focus on measurable output to check if the system works - throughput, latency
Control & minimize blast radius

It gave them more confidence to experiment & get better results.

3. Use Cases

Here’s what they do with chaos engineering:

Reduce the number of failures
Improve system availability - 99.9%
Find potential failures & automate the fix
Check if failover mechanisms work as expected
Find system bottlenecks & single points of failure
Check if data backup & restoration work as expected
Check how the system responds to service dependency failures
Do better capacity planning by studying the system’s response to various traffic
Check if the system recovers from failures quickly - mean time to resolution (MTTR)