The System Design Newsletter

The System Design Newsletter

How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮

#51: Break Into Saga Design Pattern (4 Minutes)

Neo Kim's avatar
Neo Kim
Jul 04, 2024
∙ Paid

Get my system design playbook for FREE on newsletter signup:


This post outlines the Saga design pattern. You will find references at the bottom of this page if you want to go deeper.

  • Refer just 3 people & I'll send you some rewards as a thank you.

Saga design pattern is often used to achieve data consistency and reliability in distributed systems. Microsoft now owns the Halo game series.

Once upon a time, a game development company named Bungie made a strategy game.

Yet they didn’t have success with it.

So they pivoted to create a shooting game and called it Halo.

They used a single SQL database to store the entire game data.

But their growth rate was incredible.

saga design pattern

And it became difficult to store all data in a single database.

So they set up a NoSQL database and partitioned it.

While each game can have up to 32 players.

And each player’s data gets stored in a different database partition.

Storing Game Data Across Database Partitions
Storing Data Across Database Partitions

Although it temporarily solved their scalability issue, it created new problems.

Here are some of them:

1. Atomicity:

Each player in the same game must see the correct game points. That means atomic writes.

Atomicity is the idea that the writes to every partition succeed or no writes happen at all.

Yet there’s a risk of database partition failure.

This means a failed partition will have wrong data due to missing writes.

So it’s hard to achieve atomicity with a partitioned database.

2. Consistency:

Consistency means changing data from one valid state to another valid state.

Yet there’s a risk of network latency and network failures.

That means some partitions will contain outdated data.

So it’s hard to achieve consistency with a partitioned database.


Saga Design Pattern

They wanted a simple & scalable failure management pattern.

So they set up Saga.

Here’s how Saga works:

1. Divide & Conquer:

It splits a transaction into sub-transactions.

And assigns a separate sub-transaction to each database partition.

That means it still looks like a single transaction.

Saga Applying Compensating Transactions on a Failed Sub-Transaction
Saga Applying Compensating Transactions after a Failed Sub-Transaction

Besides a revert action is available for each sub-transaction - compensating transaction. It’s like a correction and not an undo. For example, canceling a hotel booking instead of removing it.

  • The compensating transactions get executed only if a sub-transaction fails.

  • The compensating transactions must be idempotent, so retries are possible without side effects.

2. Interacting with Database Partitions:

They set up a separate service to manage sub-transactions and called it Orchestrator.

Controlling Sub-Transactions Using Saga Orchestrator
Saga Orchestrator Controlling Sub-Transactions

It let them:

  • Control sub-transactions

  • Apply compensating transactions if needed

3. State Information:

It’s necessary to store each sub-transaction's state (start & end) outside Orchestrator.

Otherwise it will become a single point of failure.

So they use a durable and distributed log.

Distributed Log Storing State of Saga Sub-Transactions
Storing State of Sub-Transactions in Log

It let them:

  • Track if a sub-transaction failed

  • Find compensating transactions that must be executed

  • Track the state of compensating transactions

  • Keep the orchestrator stateless

  • Recover from failures

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Neo Kim · Publisher Privacy
Substack · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture