Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines the internal architecture of AWS S3. You will find references at the bottom of this page if you want to go deeper.
Share this post & I'll send you some rewards for the referrals.
Note: This post is based on my research and may differ from real-world implementation.
Once upon a time, there was an analytics startup.
They collect data from customer websites and store it in log files.
Yet they had only a few customers.
So a tiny storage server was enough.
But one morning they got a new customer with an extremely popular website.
And number of log files started to skyrocket.
Yet their storage server had only limited capacity.
So they bought new hardware.
Although it temporarily solved their storage issues, there were newer problems.
Here are some of them:
1. Scalability
The storage server might become a capacity bottleneck over time.
While installation and maintenance of a larger storage server is expensive.
2. Performance
The storage server must be optimized for performance.
But they didn’t have the time and knowledge for it.
Onward.
Product for Engineers - Sponsor
Product for Engineers is PostHog’s newsletter dedicated to helping engineers improve their product skills. It features curated advice on building great products, lessons (and mistakes) from building PostHog, and research into the practices of top startups.
They wanted to ditch the storage management problem.
And focus only on product development.
So they moved to Amazon Simple Storage Service (S3) - an object storage.
It stores unstructured data without hierarchy.
And handle 100 million requests per second.
Yet having performance at scale is a hard problem.
So smart engineers at Amazon used simple ideas to solve it.
S3 Architecture
Here’s how S3 works:
1. Scalability
They provide REST API via the web server.
While metadata & file content are stored separately - it lets them scale easily.
They store the metadata of uploaded data objects in a key-value database. And cache it for high availability.
Each component in the above diagram consists of many microservices. While services interact with each other via API contracts.
Ready for the best part?
2. Performance
They store uploaded data in mechanical hard disks to reduce costs.
And organize data on the hard disk using ShardStore - it gives better performance. Think of ShardStore as a variant of log-structured merge (LSM) tree data structure.
A larger hard disk can store more data.
But seek & rotation times remain constant due to moving parts. So its throughput is the same as a small disk.
Put simply, a larger disk performs poorly in retrieving data.
Throughput means the amount of data transferred over time - measured in MB/s.
Imagine the seek time as time needed to move the head to a specific track on the disk.
Think of rotation time as the time needed for the head to reach a specific piece of data.
Also a single disk might become a hot spot if the data isn’t distributed uniformly across disks.
So they replicate data across many disks and do parallel reads - it gives higher throughput.
Besides the load on a specific disk is lower because data can be read from any disk. Thus preventing hot spots.
Yet full data replication is expensive from a storage perspective.
So they use erasure coding to replicate data.
Think of erasure coding as a technique to replicate data with a smaller storage overhead.
Here’s how it works:
A data object is split into pieces called identity shards.
Mathematical algorithms are used to create extra chunks called parity shards.
The number of parity shards is lower than identity shards.
The parity shards contain enough information to recreate any identity shards. Thus erasure coding offers the same level of fault tolerance as full replication.
Also any combination of identity and parity shards can recreate the data object. So there’s no need to replicate the entire dataset, thus reducing storage needs.
Besides they store shards across different hard disks to avoid hot spots. And data objects from a single customer are distributed across disks for performance.
S3 stores 280 trillion data objects across a million hard disks - zettabytes of data.
And offer 99.99% availability.
This case study shows you don’t need expensive hardware to scale with performance.
👋 PS - Are you unhappy at your current job?
And preparing for system design interviews to get your dream job can be stressful.
Don't worry, I'm working on content to help you pass the system design interview. I'll make it easier - you spend only a few minutes each week to go from 0 to 1. Yet paid subscription fees will be higher than current pledge fees.
So pledge now to get access at a lower price.
“This newsletter is packed with valuable information, provided in a simple & clear way.” Petar
Subscribe to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter.
You are now 97,001+ readers strong, very close to 98k. Let’s try to get 98k readers by 30 October. Consider sharing this post with your friends and get rewards.
Y’all are the best.
References
FAST '23 - Building and Operating a Pretty Big Storage System (My Adventures in Amazon S3)
Building and operating a pretty big storage system called S3
Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3
AWS is building a new storage backend for S3 on 40k lines of Rust
Block diagrams created with Eraser
Good writeup, but please do read Alex Xus write up as well in ByteByte Go !
Not comparing both as it would be unfair
Grear research, I have shared it via email to my friends.