5 Comments
тна Return to thread

"Besides they use the HTTP trailer to send checksum.

Because it allows sending extra data at the end of chunked data.

Thus avoiding the need to scan data twice and check data integrity at scale."

What do we mean by 'avoiding the need to scan data twice'? Why do need to scan data twice to calculate checksum? I don't quite get it. Can someone explain please?

Expand full comment

Here's how I understood it:

1) The data goes through many network hops from the client to reach the S3 server

2) Each server in the middle has to scan data twice without HTTP trailer

3) The servers receive the data and compute checksum

4) And re-compute the checksum before sending it to the next server in the path

5) HTTP trailer attaches checksum to the chunks and avoids it

Probably I misunderstood it.

So someone else could explain it better.

Expand full comment

At #2, each server in the middle would scan the data and compute the checksum. It can check what it received with what it computed and the pattern continues. I donтАЩt get why it has to scan twice? What benefit will http trailer provide here?

Expand full comment

Basically including checksum to the HTTP trailer header offers an benefit - Client-Side Integrity check, let's say you uploading a file in chunks, 16 MB file with 1 MB chunk means 16 chunks uploaded, while uploading the last chunk include the checksum calculated at client side as HTTP Trailer header. On S3 server side, it checks for Trailer header and knows that last chunk is received and calculates the checksum for the uploaded 16 MB file and compare the checksum which was sent as part of HTTP Trailer header, if mismatch then responds with error code 419 (checksum failed) and also include the Trailer header as part of the HTTP Response header. In case of successful match, still it includes the HTTP Trailer with the server generated checksum as part of the Response header so that client can validate to see if the same checksum is received as part of response Trailer header and does the client side integrity check. "Each server in the middle has to scan data twice without HTTP trailer" - I think the statement is bit confusing and misleading. Because checksum is generated on client Side (either by SDK or custom gen) and also at the S3 end, no need for the middle man to scan each object irrespective of whether Trailer header present or not.

In another scenario, where client do not add checksum as HTTP Trailer header, but server calculates the checksum and sends back in Response as 200 response, now client can lookout for Trailer header in response and validates the checksum, if different then clients knows something is corrupted and issue a delete request and then retry the upload again. But having Trailer header from client side you can avoid the additional delete request.

Expand full comment

Inline to the thinking and agree! Thanks Sathish.

Expand full comment