How Netflix Uses Chaos Engineering to Create Resilience Systems 🐒
#53: Break Into Netflix Engineering (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
Chaos engineering is used to build resilient1 distributed systems. You will find references at the bottom of this page if you want to go deeper.
Share this post & I'll send you some rewards for the referrals.
Once upon a time, Netflix offered DVD rentals via mail.
Yet their growth rate was limited.
So they pivoted to create a streaming service.
And set up a monolith tech stack.
Although explosive growth is a good problem, scaling their infrastructure became difficult.
So they set up microservices.
But it created newer problems.
Here are some of them:
1. Reliable Network
Computer networks suffer from:
Latency issues
Network failures
Bandwidth limitations
So there’s a risk of communication failure between services.
2. Resilience
The resilience of a distributed system depends on its weakest component.
While weakest component is usually found after a failure - thus affecting users.
Chaos Engineering
They wanted to manage the chaos present in distributed systems.
So they created chaos engineering.
Here’s how:
1. Implementation
A system with 0 downtime doesn’t exist.
But downtime can be minimized via automation.
So they proactively find potential failures during office hours & then automate the fix. This means failure gets fixed quickly via automation if it occurs again. Observing a distributed system's behavior in a controlled experiment is called chaos engineering.
It helps to find a problem before it causes a production outage.
They created a tool to randomly shut down servers and called it chaos monkey:
It gets information about available servers via the continuous delivery platform
It interacts via the continuous delivery platform to shut down servers
Then they check if traffic gets routed to another server without affecting users.
Here’s how they do chaos engineering:
Create a hypothesis about how the system will behave during a failure
Run a small test to introduce failure - switch off the server or change the network configuration
Observe the system’s behavior & measure the failure impact
Automate fix for the problem
Then rerun the test to check if the automated fix works as expected.
They run chaos engineering in production traffic for accuracy.
Yet blast radius2 must be controlled to avoid affecting users. Here’s how:
Prepare for the worst case with a backup plan
Use feature flags to roll back changes quickly if things go wrong
Run tests in pre-production before production
Run only small tests first & then scale
Introduce one chaos variable at a time - don’t break everything together
Besides they measure test impact properly to prevent unnecessary damage.
2. Principles
They created chaos engineering around these principles:
Automate tests to save cost & time
Run tests in production for the same traffic pattern & reliable results
Run tests with events based on potential impact & frequency - server crash, wrong API response, traffic spike
Focus on measurable output to check if the system works - throughput, latency
Control & minimize blast radius
It gave them more confidence to experiment & get better results.
3. Use Cases
Here’s what they do with chaos engineering:
Reduce the number of failures
Improve system availability - 99.9%
Find potential failures & automate the fix
Check if failover mechanisms work as expected
Find system bottlenecks & single points of failure
Check if data backup & restoration work as expected
Check how the system responds to service dependency failures
Do better capacity planning by studying the system’s response to various traffic
Check if the system recovers from failures quickly - mean time to resolution (MTTR)
Besides they created chaos monkey variants to handle more use cases.
They wrote chaos monkey in Go and open-sourced it.
While Netflix has only a few minutes of downtime per year.
And remains the world’s largest streaming service.
This case study shows testing is important to build complex systems & move faster.
👋 PS - Are you preparing for the system design interview?
Don't worry. I'm working on content to help you pass the interview. Yet paid subscription fees will be higher than current pledge fees.
So pledge now if you need access at a lower price.
Subscribe to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
AWS re: Invent 2020: Testing resiliency using chaos engineering
AWS re: Invent 2022 - The evolution of chaos engineering at Netflix
Resilience is the ability of a system to recover from failures
Blast radius is the area of a system affected during a test
Good insight!!!
Great article!