The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒
Copy link
Facebook
Email
Notes
More

How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒

#53: Break Into Netflix Engineering (4 minutes)

Neo Kim's avatar
Neo Kim
Aug 05, 2024
117

Share this post

The System Design Newsletter
The System Design Newsletter
How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒
Copy link
Facebook
Email
Notes
More
7
9
Share

Get the powerful template to approach system design for FREE on newsletter sign-up:


Chaos engineering is used to build resilient1 distributed systems. You will find references at the bottom of this page if you want to go deeper.

  • Share this post & I'll send you some rewards for the referrals.

Once upon a time, Netflix offered DVD rentals via mail.

Yet their growth rate was limited.

So they pivoted to create a streaming service.

And set up a monolith tech stack.

Although explosive growth is a good problem, scaling their infrastructure became difficult.

Chaos Engineering

So they set up microservices.

But it created newer problems.

Here are some of them:

1. Reliable Network

Computer networks suffer from:

  • Latency issues

  • Network failures

  • Bandwidth limitations

So there’s a risk of communication failure between services.

2. Resilience

The resilience of a distributed system depends on its weakest component.

While weakest component is usually found after a failure - thus affecting users.

system design newsletter

Chaos Engineering

They wanted to manage the chaos present in distributed systems.

So they created chaos engineering.

Here’s how:

1. Implementation

A system with 0 downtime doesn’t exist.

But downtime can be minimized via automation.

So they proactively find potential failures during office hours & then automate the fix. This means failure gets fixed quickly via automation if it occurs again. Observing a distributed system's behavior in a controlled experiment is called chaos engineering.

It helps to find a problem before it causes a production outage.

Chaos Monkey Turning off Traffic to the Main Database
Chaos Monkey Turning off Traffic to the Main Database

They created a tool to randomly shut down servers and called it chaos monkey:

  • It gets information about available servers via the continuous delivery platform

  • It interacts via the continuous delivery platform to shut down servers

Then they check if traffic gets routed to another server without affecting users.

A Systematic Approach to Chaos Engineering
Systematic Approach to Chaos Engineering

Here’s how they do chaos engineering:

  • Create a hypothesis about how the system will behave during a failure

  • Run a small test to introduce failure - switch off the server or change the network configuration

  • Observe the system’s behavior & measure the failure impact

  • Automate fix for the problem

Then rerun the test to check if the automated fix works as expected.

They run chaos engineering in production traffic for accuracy.

Yet blast radius2 must be controlled to avoid affecting users. Here’s how:

  • Prepare for the worst case with a backup plan

  • Use feature flags to roll back changes quickly if things go wrong

  • Run tests in pre-production before production

  • Run only small tests first & then scale

  • Introduce one chaos variable at a time - don’t break everything together

Besides they measure test impact properly to prevent unnecessary damage.

2. Principles

Principles of Chaos Engineering
Principles of Chaos Engineering

They created chaos engineering around these principles:

  • Automate tests to save cost & time

  • Run tests in production for the same traffic pattern & reliable results

  • Run tests with events based on potential impact & frequency - server crash, wrong API response, traffic spike

  • Focus on measurable output to check if the system works - throughput, latency

  • Control & minimize blast radius

It gave them more confidence to experiment & get better results.

3. Use Cases

Chaos Engineering Use Cases
Chaos Engineering Use Cases

Here’s what they do with chaos engineering:

  • Reduce the number of failures

  • Improve system availability - 99.9%

  • Find potential failures & automate the fix

  • Check if failover mechanisms work as expected

  • Find system bottlenecks & single points of failure

  • Check if data backup & restoration work as expected

  • Check how the system responds to service dependency failures

  • Do better capacity planning by studying the system’s response to various traffic

  • Check if the system recovers from failures quickly - mean time to resolution (MTTR)

Besides they created chaos monkey variants to handle more use cases.

system design newsletter

They wrote chaos monkey in Go and open-sourced it.

While Netflix has only a few minutes of downtime per year.

And remains the world’s largest streaming service.

This case study shows testing is important to build complex systems & move faster.


👋 PS - Are you unhappy at your current job?

While preparing for system design interviews to get your dream job can be stressful.

Don't worry, I'm working on content to help you pass the system design interview. I'll make it easier - you spend only a few minutes each week to go from 0 to 1. Yet paid subscription fees will be higher than current pledge fees.

So pledge now to get access at a lower price.

"This newsletter describes how top tech companies design systems." Jordan


Subscribe to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram | Bluesky

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

system design newsletter

Share


How Meta Achieves 99.99999999% Cache Consistency 🎯

How Meta Achieves 99.99999999% Cache Consistency 🎯

Neo Kim
·
July 18, 2024
Read full story
How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

Neo Kim
·
July 4, 2024
Read full story

References

  • The Netflix Simian Army

  • Netflix Chaos Monkey Upgraded

  • Chaos Engineering Upgraded

  • Principles of Chaos Engineering

  • Chaos Monkey Documentation

  • AWS re: Invent 2020: Testing resiliency using chaos engineering

  • AWS re: Invent 2022 - The evolution of chaos engineering at Netflix

  • Understanding Chaos Engineering

  • Kolton Andrus on Breaking Things at Netflix

  • Chaos Engineering: the history, principles, and practice

  • What is Chaos Engineering?

  • NAB deploys Chaos Monkey to kill servers 24/7

  • Eight Fallacies Of Distributed Computing

  • How to Get Started with Chaos Engineering

  • 4 Chaos Experiments to Start With

  • Microservices Lessons From Netflix

  • Mastering Chaos - A Netflix Guide to Microservices

  • Diagram tracking chaos tools & engineers

  • Images inspired by visualize value

1

Resilience is the ability of a system to recover from failures

2

Blast radius is the area of a system affected during a test


Subscribe to The System Design Newsletter

By Neo Kim · Launched 2 years ago
Weekly newsletter to help busy engineers become good at system design
Chirag patel's avatar
Nalin Sharma's avatar
erdinç özkan's avatar
Alexandre Zajac's avatar
Henry Ogedegbe Jr.'s avatar
117 Likes∙
9 Restacks
117

Share this post

The System Design Newsletter
The System Design Newsletter
How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒
Copy link
Facebook
Email
Notes
More
7
9
Share

Discussion about this post

User's avatar
Salvador Lorca 📚 ⭕️'s avatar
Salvador Lorca 📚 ⭕️
Aug 6

Good insight!!!

Expand full comment
Like (1)
Reply
Share
1 reply by Neo Kim
Andrés Álvarez Iglesias's avatar
Andrés Álvarez Iglesias
Aug 5

Great article!

Expand full comment
Like (1)
Reply
Share
1 reply by Neo Kim
5 more comments...
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 • 
Neo Kim
730

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
Copy link
Facebook
Email
Notes
More
24
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 • 
Neo Kim
234

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
Copy link
Facebook
Email
Notes
More
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 • 
Neo Kim
377

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
Copy link
Facebook
Email
Notes
More
29

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.