Microservices Lessons From Netflix

#19: Check This Out - Awesome Microservices Guide (5 minutes)

Oct 31, 2023

Get my system design playbook for FREE on newsletter signup:

This post outlines microservices architecture best practices from Netflix. If you want to learn more, scroll to the bottom and find the references.

Share this post & I'll send you some rewards for the referrals.

Microservices Architecture

Netflix runs on AWS. They started with a monolith and moved to microservices. Their reasons for migrating to microservices were the following:

It was difficult to find bugs with many changes to a single codebase
It became difficult to scale vertically
There were many single points of failures

Netflix microservices; Microservices benefits — Microservices benefits

Microservices Challenges and Solutions

3 main problems with microservices architecture are:

Dependency
Scale
Variance

1. Dependency

Here are 4 scenarios where the dependency problem occurs:

i) Intra-Service Requests

A client request results in a service calling another service. Put another way, service A needs to call service B to create a response.

Netflix microservices; Intra-service requests — Intra-service requests

The problem with it is that a failure of a microservice results in cascading failures. The solutions are:

Use the circuit breaker pattern to prevent cascading failures. It avoids an operation that will probably fail
Do fault injection testing to check if the circuit breaker pattern works as expected. It does it by creating artificial traffic
Set up fallback to a static page to keep the system always responsive
Install exponential backoff to prevent thundering herd problem

Yet degraded availability and increased system test scope become a problem. Because availability reduces when the downtime of individual microservices gets combined. And the permutations of test scope grow by a ton as the number of microservices increases.

So it’s important to identify critical services and create a bypass path to avoid non-critical service failures.

ii) Client Libraries

An API gateway allows the reuse of business logic across many types of clients.

An API gateway is a central entry point that routes API requests. But it has the following limitations:

Heap consumption becomes difficult to manage
Potential logical defects or bugs
Potential transitive dependencies

So the solution is to keep the API gateway simple and avoid it from becoming the new monolith.

iii) Persistence

The choice of the storage layer depends on the CAP theorem. Put another way, it is a trade-off decision between availability and consistency level.

So the solution is to study the data access patterns and choose the right storage.

iv) Infrastructure

The entire data center might fail. So the solution is to replicate the infrastructure across many data centers.

Netflix outage on Forbes — Netflix outage; Forbes.com

2. Scale

The ability of the system to manage increased workload while maintaining performance is called scale. The 3 dimensions of horizontal scalability are:

Keep the service stateless if possible
Partition the service if it can't be stateless
Replicate the service

Netflix microservices; Scalability — Stateful service and Stateless service

Here are 3 scenarios where the scale problem occurs:

i) Stateless Services

The 2 qualities of stateless service are:

There is no instance affinity (sticky sessions). Put another way, requests don't get routed to the same server
Failure of a stateless service is not notable

The stateless service needs to be replicated for high availability. And autoscaling must be set up for on-demand replication.

Also autoscaling reduces the impact of the following problems:

Reduced compute efficiency
Node failures
Traffic spikes
Performance bugs

The database and cache are not stateless services.

Chaos engineering checks whether autoscaling works as expected. It tests system resilience through controlled disruptions to ensure improved reliability.

ii) Stateful Services

The database and cache are stateful services. Also a custom service that holds large amounts of data is a stateful service. The failure of a stateful service is a notable event.

An anti-pattern with the stateful service is having a sticky session without replication. Because it creates a single point of failure.

The solution is to replicate the writes across many servers in different data centers. And route the reads to the local data center.

iii) Hybrid Services

A cache is a hybrid service. A hybrid service expects an extreme load. For example, Netflix’s cache gets 30 million requests per second.

The best approach to building a hybrid service is the following:

Partition the workload using techniques like consistent hashing
Enable request-level caching
Allow fallback to a database

And use Chaos engineering to check whether the hybrid service remains functional under extreme workloads.

3. Variance

The variety in the software architecture is called variance. The system complexity grows as variance increases.

Here are 2 scenarios where the scale problem occurs:

i) Operational Drift

Operational drift is the unintentional variance that happens as time passes. It is usually a side-effect of new features added to the system. The examples of operational drift are:

Increased alert thresholds
Increased timeouts
Degraded throughput

The solution to this is continuous learning and automation.

Netflix microservices; Continuous Learning and Automation — Continuous Learning and Automation

Here is the workflow:

Review an incident resolution
Put remediation in place to prevent it from occurring again
Analyze many incidents to identify patterns
Derive best practices from incident resolutions
Automate the best practices if possible
Promote the adoption of best practices
And repeat

ii) Polyglot

The variance introduced by engineers on purpose is called Polyglot. It happens when different programming languages are used to create different microservices.

It comes with the following drawbacks:

A large amount of work needed to get productive tooling
Extra operational complexity
Difficulty in server management
Business logic duplication across many technologies
Increased learning curve to become an expert

The solution to this problem is to use proven technologies.

Besides polyglot forces the decomposition of the API gateway, which is a good thing. So the best ways to use Polyglot architecture are:

Raise team awareness of the costs of each technology
Limit centralized support to critical services
Prioritize based on the impact
Create reusable solutions by offering a pool of proven technologies

Here is a checklist of best practices for microservices architecture from Netflix:

Automate tasks
Setup alerts
Autoscale to handle dynamic load
Chaos engineering for improved reliability
Consistent naming conventions
Health check services
Blue-green deployment to rollback quickly
Configure timeouts, retries, and fallbacks

Also change is inevitable and things always break with changes. So the best approach is to move faster but with fewer breaking changes.

Besides it's helpful to restructure teams to support the software architecture.

Consider subscribing to get simplified case studies delivered straight to your inbox:

System design case studies — **Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram**

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

Slack Architecture That Powers Billions of Messages a Day

October 26, 2023

Read full story

Wechat Architecture That Powers 1.67 Billion Monthly Users

October 24, 2023

Read full story

References

https://www.infoq.com/presentations/netflix-chaos-microservices/
https://netflixtechblog.com/tagged/microservices
https://www.atlassian.com/blog/teamwork/what-is-conways-law-acmi
https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

148 Likes∙

12 Restacks

MICHAEL ODELL

Nov 13, 2023

Well done. Good advise without theology 😉. I

n the dim, dark past, the first half-dozen time “distributed systems” was the New New Thing, people discovered (the hard way) that what runs great on the white board isn’t so wonderful up against real silicon. Over the years, new systems based on decomposition (to varying degrees of “extreme”) explored how the decomposed parts “recomposed” dynamically to provide a service application programs had come to embrace. Some emphasized fast, lightweight mechanisms optimized for fine-grained concurrency, others took the conceptual virtual machine (not instruction set interpreters) to a very high degree of sophistication, and some explored extremely novel (some might say “baroque”) techniques of considerable imagination. Time and again, however, the Gold Standard for moving between protection domains was a procedure call plus fiddling with some protection registers. All the schemes require validating arguments to some degree.

And some required what today would be catastrophic cache flushing. The reality not yet avoided is that ring-crossings are expensive compared to function calls.

Transfers between complex protection domains are painfully slow. And putting network communication in the middle of any of these needed a REALLY good reason.

I’m a huge fan of system decomposition as long as it doesn’t start to smell bad. Given that RESTful stuff was not designed as an efficient RPC mechanism,

How does one decide when and what can be packaged as a Microservice without causing more trouble down the road as the scaling heats a up?

It seems one would want to be pretty confident in the architectural partitioning of the system, or at least the ability to bob-and-weave as

Necessary to keep the wheels on the wanton while the jet engines can be changed in midair.

Cap’n Fred’s Rule #1: Always keep the water on the outside of the boat.

Cap’n Fred’s Rule #2: When you hit something, do it going as slowly as possible.

Cap’n Fred’s Rule #3: There is no speed at which a bridge piling that won’t violate Rule #1.

Stay dry!

-mo

Expand full comment

Like (2)

Anton Zaides

Nov 2, 2023

Love the real-world examples! Much more interesting to learn that way :)

I can relate to the api gateway problem... As we have different microservices responsible for seperate entitites, the logics to combine and process the information usually happens at the gateway, which makes it much heavier than it should be.

The alternative is to have the logic in one of the microservices, which might break the 'domain ownership' (as it'll need to know more).

Haven't found a satisfying guideline for that.

Like (1)

1 reply by Neo Kim

2 more comments...

The System Design Newsletter

Slack Architecture That Powers Billions of Messages a Day

Wechat Architecture That Powers 1.67 Billion Monthly Users

Discussion about this post