How 0.1% Companies Do Hyperscaling

#34: And Cell Based Architecture Explained Like You’re Twenty (7 minutes)

Jan 28, 2024

Get my system design playbook for FREE on newsletter signup:

This post outlines how cell based architecture works. If you want to learn more, scroll to the bottom and find the references.

Share this post & I'll send you some rewards for the referrals.

Once upon a time, there lived a software architect named John.

He worked for a startup that went through a mind-boggling growth rate.

Although explosive growth is a good problem to have, scalability issues became impossible to solve.

Cell based architecture; Hyperscale — Hyperscale

Hypergrowth occurs when growth exceeds the expected level by more than 40%. Yet most companies with hypergrowth fail due to scalability problems.

So they need to hyperscale.

Hyperscale is the architecture’s ability to scale quickly based on the changing demand. It needs extreme parallelization and fault isolation.

So he was frustrated.

Titanic designer — Titanic’s Designer Talking About Its Resilience

Until one weekend he watches the film Titanic on television.

The scene in which the ship’s designer talks about its resilience caught his attention.

Vertical Partition Walls Dividing the Ship’s Interior Into Watertight Compartments

They partitioned the ship's interior using vertical walls. Thus compartments become water-tight and self-contained. And the ship wouldn't sink unless many compartments were affected.

Put another way, the walls prevent seawater flooding from affecting the entire ship.

He had a eureka moment. And writes down notes with a pencil before going to bed.

His notes from that day laid the foundation of cell based architecture.

Pointer (Featured)

If you find system design useful, consider checking out Pointer.io. It’s a reading club for software developers read by CTOs, engineering managers, and senior developers. They send out super high-quality engineering-related content and it’s completely free!

Cell Based Architecture

Here’s how cell based architecture works:

1. Implementation

A system is divided into cells and the traffic gets routed between the cells using a cell router.

A cell might contain many services, load balancers, and databases. And it’s technology-agnostic.

Put another way, a cell is a completely self-contained instance of the application. So it’s independently deployable and observable.

The customer traffic gets routed to the right cells via a thin layer called the cell router.

The failure of a cell doesn’t affect another one because they are separated at the logical level. In other words, cell based architecture prevents single points of failure.

A cell could be created using an infrastructure as a code script or any programming language. While each cell gets a name and a version identifier.

The components within a cell communicate with each other through supported network mechanisms. While external communication happens on standard network protocols via a gateway.

Reducing Scope of Impact With Cell Based Architecture

A cell shouldn’t share its state with others and handle only a subset of the total traffic. Thus the impact of a failure like a bad code deployment is reduced.

If a cell fails, only the customers in that specific cell are affected. So the blast radius is lower.

Put another way, if a system contains 10 cells and serves 100 requests. The failure of a single cell affects only 10% of requests.

Blast radius is the approximate number of customers affected by a failure.

2. Key Concepts

Here are some key concepts of cell based architecture:

a) Customer Placement

The cells get partitioned using a partition key. A simple or composite partition key can be used to distribute the traffic between cells.

And the Customer ID is a candidate partition key for most use cases.

The cell router forwards requests to cells based on the partition key. Consistent hashing, or range-based mapping algorithms can be used to map customers to cells.

Provisioning Cells Based on the Number of Customers

The number of customers supported by a specific cell depends on its capacity. And a service can be scaled out by adding more cells.

b) Cell Router

A simple router can be implemented with a DNS. While an API gateway and a NoSQL database like DynamoDB can be used for complex routing.

The routing layer must be kept simple and horizontally scalable to prevent failures.

A dedicated gateway can be installed on each cell for communication. And it becomes the single access point to the cell. Alternatively a shared gateway can be set up via a centralized deployment.

The gateway provides a well-defined interface to a subset of APIs, events, or streams.

c) Cell Size

Each cell must have a fixed maximum size. Thus the risk of non-linear scaling factors and contention points gets reduced.

Also scaling out and stress testing is easier with fixed-size cells. So the mean time between failures (MTBF) is higher.

And the number of hosts that need to be touched for deployments and diagnosis is reduced. So the mean time to recovery (MTTR) is lower.

Yet the cell boundary depends on the business domain and organizational structure.

The blast radius is smaller when there are many small cells. Yet capacity is better used with a few large cells. So an optimal cell size must be chosen for best performance.

d) Cell Deployment

A deployment must be tested on a single cell before rolling it to others for safety.

Cell Deployment Occurring in Waves — Cell Deployment in Waves

So the cells get deployed in waves and the metrics are monitored. A deployment is rolled back if there is a failure and a new wave gets introduced.

3. Use Cases

Some use cases where cell based architecture is a good fit are:

Applications that need high availability
High-scale systems that are too big to fail
Systems with low recovery time objective (RTO)
Systems with many combinations of test cases but insufficient coverage

The cell boundaries provide resilience against failures like buggy feature deployments. And it avoids poison pill requests by limiting the scope of impact.

Resilience is the ability of the system to recover from a failure quickly.

A poison pill request occurs when a request triggers a failure mode across the system.

The cell based architecture makes the design more modular and reduces failover problems.

Besides issues due to misbehaving clients, data corruption, and operational mistakes gets prevented.

4. Best Practices

Here’s a list of best practices with cell based architecture:

Start with more than a single cell from day 1 to get familiar with the architecture
Consider the current tech stack as cell zero and add a router layer
Perform a failure mode analysis of the cell to find its resilience
A single team could own an entire cell for simplicity. But with cell boundary on the bounded context
Cells should communicate via versioned and well-defined APIs
Cells must be secured through policies in API gateways
Cells should throttle and scale independently
The dependencies between cells reduce the benefits of cell based architecture. So they should be kept minimum
There shouldn’t be shared resources like databases to avoid a global state
The cells should get deployed in waves

5. Anti-Patterns

Here’s a list of common anti-patterns with cell based architecture:

Growing the cell size without limits
Deploying code to every cell at once
Sharing state between cells
Adding complex logic to the routing layer
Increased interactions between cells

6. Cell Based Architecture vs Microservices

A cell could represent a bounded context. And it can be implemented as a monolith, a set of microservices, or serverless functions.

Bounded Context in Domain-Driven Design (DDD) is the explicit domain model boundary.

Cell Based Architecture vs Microservices — Microservices vs Cell Based Architecture

Cell based architecture applies the fundamental concepts of microservices across the entire application. For example, microservices allow the creation of small and separate services. If a service is down, the system will remain operational.

While the cell based architecture creates a copy of the entire production environment or the bounded context. So only a few users get affected by a cell failure.

A service failure is inevitable. But its impact can be controlled.

Companies like Slack, DoorDash, and Amazon run cell based architecture.

Yet it isn’t an alternative to microservices architecture. Instead a way to approach microservices with more agility and confidence.

Also it increases the costs and complexity of the architecture due to infrastructure redundancy. So it’s important to understand the trade-offs and use it only if necessary.

Consider subscribing to get simplified case studies delivered straight to your inbox:

Author NK; System design case studies — **Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram**

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

How Hashnode Generates Feed at Scale

Neo Kim

January 21, 2024

Read full story

How Cloudflare Was Able to Support 55 Million Requests per Second With Only 15 Postgres Clusters

Neo Kim

January 12, 2024

Read full story

References

Petar Ivanov

That's a great outline of the Cell-Based Architecture, Neo! And to be honest, I haven't heard about it till now. And as you shared, that's a great addition and enhancement to the microservices.

1 reply by Neo Kim

lakhan jindam

I think its gaining some traction now, in my current company we are already evaluating and doing a POC for it. Would be happy to write about it once its in action until then its difficult to say.

8 more comments...

The System Design Newsletter

How Hashnode Generates Feed at Scale

How Cloudflare Was Able to Support 55 Million Requests per Second With Only 15 Postgres Clusters

Discussion about this post

Ready for more?