The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
Amazon Prime Video Microservices Top Failure

Amazon Prime Video Microservices Top Failure

#4: Read Now - Awful Microservices Architecture (7 minutes)

Neo Kim's avatar
Neo Kim
Sep 14, 2023
48

Share this post

The System Design Newsletter
The System Design Newsletter
Amazon Prime Video Microservices Top Failure
1
6
Share

Get my system design playbook for FREE on newsletter signup:


They put serverless and microservices in their architecture to scale. But they faced high costs and scalability issues. A short story on microservices over-engineering.

Prime Video is a streaming service from Amazon. It offers a collection of movies and TV shows.

This post outlines the re-architecture of the Prime Video stream quality monitoring tool with a monolith. If you want to learn more, scroll to the bottom and find the references.

  • Share this post & I'll send you some rewards for the referrals.

They used the stream quality monitoring tool to inspect the quality of every Prime Video stream.

Their reasons for monitoring stream quality were as follows:

  • Identify quality issues, such as block corruption or audio-video synchronization problems

  • Trigger an automatic fix if there is a problem with the stream

Prime video microservices
Prime video failure; Source: Amazon Help on Twitter

They kept the initial release of the stream quality monitoring tool simple. Because they wanted to check only the streams with the highest number of viewers.

But the requirements changed as the video content and subscribers grew. They wanted to check thousands of concurrent streams.

The high-level workflow of the stream quality monitoring tool is as follows:

Prime video; Stream quality monitoring tool
Stream quality monitoring tool workflow; Flow chart

Amazon Prime Video Microservices

They built the stream quality monitoring tool with microservices and serverless components. This was their initial release.

Microservices and serverless components are poster children of scalability. But it turned out to be a wrong architectural decision in this specific use case.

Workflow

Prime video microservices
Stream quality monitoring tool microservices; Prime Video

An outline of the stream quality monitoring tool workflow is as follows:

  1. Conversion service translates streams to frames or decrypted audio buffers

  2. A temporary Amazon Simple Storage Service (S3) bucket stores the frames

  3. The defect detector queries the temporary S3 bucket to download the frames

  4. The defect detector executes algorithms to inspect frames in real time. This will identify defects: video freeze, block corruption, or audio-video synchronization problems

  5. The defect detector sends out real-time notifications if there is a defect

  6. The notifications trigger a corrective action on the stream

  7. The S3 bucket stores the defect detection results. They used this data to generate analytics

Amazon S3 is a cloud object storage. They used AWS Step Functions to build the conversion service and the defect detector. AWS Step Function is a workflow automation tool for coordinating AWS services.

They scaled the defect detector by plugging in new instances of Step Function and AWS Lambda. AWS Lambda is a computing service that runs code without provisioning dedicated servers.

But they ran into 2 problems with this architecture:

  • It became expensive on a high-scale workload

  • There were scalability issues

What Went Wrong?

There were 2 factors that determined the pricing of AWS Step Functions. The number of requests (state transitions) for the workflow and its duration. But there was a state transition for every second of the stream. So, this resulted in high costs.

Also there is an AWS account threshold limit on the number of transitions. This became a scalability bottleneck in orchestration management with AWS Step Functions.

3 factors determine the pricing of AWS S3: data storage amount, amount of outbound data transfer, and number of requests.

The defect detector downloaded the frames from the temporary S3 bucket. The high number of requests to the temporary S3 bucket caused high costs.

Amazon Prime Video Monolith

Software architecture is the set of design decisions that, if made incorrectly, may cause your project to be canceled.

- Eoin Woods, Endava (CTO)

They realized that microservices architecture would not fix their scalability problems. So, they re-architectured the stream quality monitoring tool. And re-implemented it with monolith architecture.

Yet the high-level architecture of Prime Video remains unaffected. Put another way, Prime Video is still based on Microservices architecture.

Drawing the microservice boundaries incorrectly can significantly diminish the benefits of using microservices, or in some cases even derail the entire effort.

- Microservices: Up and Running, Ronnie Mitra, Irakli Nadareishvili

They decided to change the boundary of microservices. And set new boundaries based on the business capabilities. So, they consolidated only the microservices for the stream quality monitoring tool - and created a single monolith.

Workflow

Amazon prime video monolith
Steam quality monitoring tool monolith; Prime Video

An outline of the new stream quality monitoring tool workflow is as follows:

  1. Conversion service translates streams to frames or decrypted audio buffers

  2. The defect detector queries the conversion service to download the frames

  3. The defect detector executes algorithms to inspect frames in real time. This will identify defects: video freeze, block corruption, or audio-video synchronization problems

  4. The defect detector sends out real-time notifications if there is a defect

  5. The notifications trigger a corrective action on the stream

  6. The S3 bucket stores the defect detection results. They used this data to generate analytics

They combined all the components into a single process to create a monolith. This enabled buffer transfer through memory instead of a temporary S3 bucket. Thus, they eliminated the need for a temporary S3 bucket.

They built a lightweight orchestration layer with ECS. It replaced the expensive AWS Step Functions.

ECS is the Amazon Elastic Container Service. It simplifies the deployment, management, and scaling of containerized applications.

The scaled defect detector diagonally. They ran many instances of the defect detector in a single EC2 instance. And replicated the EC2 instance.

They preallocated the resources on AWS. This allowed them to get the services at a discount price through the compute savings plan. This approach further reduced the costs by the ton.

Microservice Monolith Architecture

The key takeaways from this case study are as follows:

  • Keep the architecture simple

  • Define microservice boundaries using domain-driven design principles

  • Back-of-the-envelope analysis is important

  • Don't follow the buzzwords

  • Microservices come with an overhead cost

They reduced the infrastructure costs by over 90% with the monolith rewrite. Also, they resolved the scalability issues.

As a result, they were able to offer a better customer experience with improved stream quality.


👋 PS - Are you unhappy at your current job?

While preparing for system design interviews to get your dream job can be stressful.

Don't worry, I'm working on content to help you pass the system design interview. I'll make it easier - you spend only a few minutes each week to go from 0 to 1. Yet paid subscription fees will be higher than current pledge fees.

So pledge now to get access at a lower price.

“An excellent newsletter to learn system design through practical case studies.” Franco


Consider subscribing to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram

Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.

system design newsletter

Share


Tumblr Shares Database Migration Strategy With 60+ Billion Rows

Tumblr Shares Database Migration Strategy With 60+ Billion Rows

NK
·
September 10, 2023
Read full story
This Is How Quora Shards MySQL to Handle 13+ Terabytes

This Is How Quora Shards MySQL to Handle 13+ Terabytes

NK
·
September 3, 2023
Read full story

References

  • Marcin Kolny. (2023, March 22). Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%. primevideotech

  • Sathya Balakrishnan, Ihsan Ozcelik. (2023, February 01). How Prime Video Uses Machine Learning to Ensure Video Quality. Prime Video Tech Blog.

  • Richard Jones. (2023, February 22). How Prime Video Troubleshoots Quickly and Cost-Effectively at Scale. Prime Video Tech Blog

  • Sathya Balakrishnan, Ihsan Ozcelik. (2022, March 04). How Prime Video Uses Machine Learning to Ensure Video Quality. Amazon Science Blog.

  • Cloudflare. (n.d.). Why Use Serverless? Cloudflare Learning.

  • Ryan Frankel. (2023, March 24). AWS S3 Pricing: How to Save Big on Cloud Storage Costs. HostingAdvice.

  • Amazon Web Services, Inc. (n.d.). AWS Step Functions Pricing. [Amazon Web Services].

  • Irakli Nadareishvili, Ronnie Mitra. (2020, December 8), Microservices: Up and Running (pp. Chapter 4). O'Reilly Media.

  • What is the difference between Amazon ECS and Amazon EC2? (2016, October 31). In Stack Overflow.

  • Allocate Memory for Amazon ECS Tasks. (n.d.). In AWS Knowledge Center.

  • Photo by Thibault Penin on Unsplash


Subscribe to The System Design Newsletter

By Neo Kim · Launched 2 years ago
Download my system design playbook for free on newsletter signup
Nur's avatar
Avinash Kumar's avatar
Vaibhav Patil's avatar
Breno Henrique's avatar
Bhai's avatar
48 Likes∙
6 Restacks
48

Share this post

The System Design Newsletter
The System Design Newsletter
Amazon Prime Video Microservices Top Failure
1
6
Share

Discussion about this post

User's avatar
Craig M's avatar
Craig M
Sep 17, 2023

I'm still not sure why everyone gets so concerned with "monolith" and "microservice" when actually all that matters is the service boundary. The largest monoliths of our day still have contracts between modules that have to be conformed to.

There is a race to the bottom when people think making your services ever more granular is the way to achieve scale and cost effectiveness. Where does this end? Do we eventually process each and every command or statement on a new "micro command as a service"?

Expand full comment
Like (3)
Reply
Share
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 • 
Neo Kim
750

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
25
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 • 
Neo Kim
269

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 • 
Neo Kim
406

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
30

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

User's avatar

javinpaul, a subscriber of The System Design Newsletter, shared this with you.