How Amazon S3 Achieves 99.999999999…

Neo Kim

Mar 5, 2024

163

Error

#38: Break Into Incredible Amazon Engineering (7 minutes)

Read →

14 Comments

Jade Wilson

Mar 6, 2024Edited

Great walkthrough. I really like the visuals and the way you structure you sentences to have breaks between them.

Just reading between the lines, the way they do data redundancy seems costly, I know they say it's a cheap solution but would be good if you have any insights on how much that costs to have such high durability this way?

Expand full comment

thanks, Jade. I don't have insights into their costs.

Perhaps they have it documented somewhere.

Expand full comment

Rushik

Mar 5, 2024

"Besides they use the HTTP trailer to send checksum.

Because it allows sending extra data at the end of chunked data.

Thus avoiding the need to scan data twice and check data integrity at scale."

What do we mean by 'avoiding the need to scan data twice'? Why do need to scan data twice to calculate checksum? I don't quite get it. Can someone explain please?

Expand full comment

Here's how I understood it:

1) The data goes through many network hops from the client to reach the S3 server

2) Each server in the middle has to scan data twice without HTTP trailer

3) The servers receive the data and compute checksum

4) And re-compute the checksum before sending it to the next server in the path

5) HTTP trailer attaches checksum to the chunks and avoids it

Probably I misunderstood it.

So someone else could explain it better.

Expand full comment

At #2, each server in the middle would scan the data and compute the checksum. It can check what it received with what it computed and the pattern continues. I don’t get why it has to scan twice? What benefit will http trailer provide here?

Expand full comment

Basically including checksum to the HTTP trailer header offers an benefit - Client-Side Integrity check, let's say you uploading a file in chunks, 16 MB file with 1 MB chunk means 16 chunks uploaded, while uploading the last chunk include the checksum calculated at client side as HTTP Trailer header. On S3 server side, it checks for Trailer header and knows that last chunk is received and calculates the checksum for the uploaded 16 MB file and compare the checksum which was sent as part of HTTP Trailer header, if mismatch then responds with error code 419 (checksum failed) and also include the Trailer header as part of the HTTP Response header. In case of successful match, still it includes the HTTP Trailer with the server generated checksum as part of the Response header so that client can validate to see if the same checksum is received as part of response Trailer header and does the client side integrity check. "Each server in the middle has to scan data twice without HTTP trailer" - I think the statement is bit confusing and misleading. Because checksum is generated on client Side (either by SDK or custom gen) and also at the S3 end, no need for the middle man to scan each object irrespective of whether Trailer header present or not.

In another scenario, where client do not add checksum as HTTP Trailer header, but server calculates the checksum and sends back in Response as 200 response, now client can lookout for Trailer header in response and validates the checksum, if different then clients knows something is corrupted and issue a delete request and then retry the upload again. But having Trailer header from client side you can avoid the additional delete request.

Expand full comment

Inline to the thinking and agree! Thanks Sathish.

Expand full comment

Like (1)

Han

Mar 15, 2024

Is the diagram "Re-Replicating Shards of a Failed Hard Disk at Scale" intended to show how shards from a failed disk gets replicated to other disks?

I don't understand how that's possible. I assume a failed disk is completely unreadable by the time it fails. I assume this diagram is not related to the data sector discussion below.

So if the disk is already unreadable, how is it going to replicate the red and green dotted shards to 2 other disks?

And if the disk is readable after failure, why care about breaking an object into shards anyway?

Expand full comment

I see, the idea is to keep the replication factor stable. That means there should always be n number of replicas for a specific data shard.

The red and green shards are already replicated on other hard disks. So when one of those hard disks fails, they replicate from the remaining hard disks to a new one. Thus keeping the replication factor constant.

Does that answer your question?

Expand full comment

Reply (1)

Han

Mar 15, 2024Edited

Thanks that's clear.

I must have read the caption as "Re-Replicating Shards FROM a Failed Hard Disk at Scale", thought the dotted shards came from the failed disk.

Looking at it again the diagram never specified where the dotted shards came from (which you answered are the remaining working disks)

Expand full comment

Like (1)

Jordan Cutler

Mar 5, 2024

Amazing walkthrough, NK! Was super cool to learn more about the internals of S3.

Also, thanks for the High Growth Engineer shout-out! I appreciate it

Expand full comment

thank you very much, Jordan.

Expand full comment

Like (1)

kevin

Jun 14, 2024

How do you know this is what they do?

Expand full comment

Dhyanesh

May 7, 2024

Nicely summarized. I was wondering how they are calculating 4 billion checksums per second. that looks like a large scale problem to solve. Do you have any insights on that?

Expand full comment

The System Design Newsletter

How Amazon S3 Achieves 99.999999999…