Amazon Prime Video Microservices Top Failure
#4: Read Now - Awful Microservices Architecture (7 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
They put serverless and microservices in their architecture to scale. But they faced high costs and scalability issues. A short story on microservices over-engineering.
Prime Video is a streaming service from Amazon. It offers a collection of movies and TV shows.
This post outlines the re-architecture of the Prime Video stream quality monitoring tool with a monolith. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with someone who wants to study system design.
They used the stream quality monitoring tool to inspect the quality of every Prime Video stream.
Their reasons for monitoring stream quality were as follows:
Identify quality issues, such as block corruption or audio-video synchronization problems
Trigger an automatic fix if there is a problem with the stream
They kept the initial release of the stream quality monitoring tool simple. Because they wanted to check only the streams with the highest number of viewers.
But the requirements changed as the video content and subscribers grew. They wanted to check thousands of concurrent streams.
The high-level workflow of the stream quality monitoring tool is as follows:
Amazon Prime Video Microservices
They built the stream quality monitoring tool with microservices and serverless components. This was their initial release.
Microservices and serverless components are poster children of scalability. But it turned out to be a wrong architectural decision in this specific use case.
Workflow
An outline of the stream quality monitoring tool workflow is as follows:
Conversion service translates streams to frames or decrypted audio buffers
A temporary Amazon Simple Storage Service (S3) bucket stores the frames
The defect detector queries the temporary S3 bucket to download the frames
The defect detector executes algorithms to inspect frames in real time. This will identify defects: video freeze, block corruption, or audio-video synchronization problems
The defect detector sends out real-time notifications if there is a defect
The notifications trigger a corrective action on the stream
The S3 bucket stores the defect detection results. They used this data to generate analytics
Amazon S3 is a cloud object storage. They used AWS Step Functions to build the conversion service and the defect detector. AWS Step Function is a workflow automation tool for coordinating AWS services.
They scaled the defect detector by plugging in new instances of Step Function and AWS Lambda. AWS Lambda is a computing service that runs code without provisioning dedicated servers.
But they ran into 2 problems with this architecture:
It became expensive on a high-scale workload
There were scalability issues
What Went Wrong?
There were 2 factors that determined the pricing of AWS Step Functions. The number of requests (state transitions) for the workflow and its duration. But there was a state transition for every second of the stream. So, this resulted in high costs.
Also there is an AWS account threshold limit on the number of transitions. This became a scalability bottleneck in orchestration management with AWS Step Functions.
3 factors determine the pricing of AWS S3: data storage amount, amount of outbound data transfer, and number of requests.
The defect detector downloaded the frames from the temporary S3 bucket. The high number of requests to the temporary S3 bucket caused high costs.
Amazon Prime Video Monolith
Software architecture is the set of design decisions that, if made incorrectly, may cause your project to be canceled.
- Eoin Woods, Endava (CTO)
They realized that microservices architecture would not fix their scalability problems. So, they re-architectured the stream quality monitoring tool. And re-implemented it with monolith architecture.
Yet the high-level architecture of Prime Video remains unaffected. Put another way, Prime Video is still based on Microservices architecture.
Drawing the microservice boundaries incorrectly can significantly diminish the benefits of using microservices, or in some cases even derail the entire effort.
- Microservices: Up and Running, Ronnie Mitra, Irakli Nadareishvili
They decided to change the boundary of microservices. And set new boundaries based on the business capabilities. So, they consolidated only the microservices for the stream quality monitoring tool - and created a single monolith.
Workflow
An outline of the new stream quality monitoring tool workflow is as follows:
Conversion service translates streams to frames or decrypted audio buffers
The defect detector queries the conversion service to download the frames
The defect detector executes algorithms to inspect frames in real time. This will identify defects: video freeze, block corruption, or audio-video synchronization problems
The defect detector sends out real-time notifications if there is a defect
The notifications trigger a corrective action on the stream
The S3 bucket stores the defect detection results. They used this data to generate analytics
They combined all the components into a single process to create a monolith. This enabled buffer transfer through memory instead of a temporary S3 bucket. Thus, they eliminated the need for a temporary S3 bucket.
They built a lightweight orchestration layer with ECS. It replaced the expensive AWS Step Functions.
ECS is the Amazon Elastic Container Service. It simplifies the deployment, management, and scaling of containerized applications.
The scaled defect detector diagonally. They ran many instances of the defect detector in a single EC2 instance. And replicated the EC2 instance.
They preallocated the resources on AWS. This allowed them to get the services at a discount price through the compute savings plan. This approach further reduced the costs by the ton.
Microservice Monolith Architecture
The key takeaways from this case study are as follows:
Keep the architecture simple
Define microservice boundaries using domain-driven design principles
Back-of-the-envelope analysis is important
Don't follow the buzzwords
Microservices come with an overhead cost
They reduced the infrastructure costs by over 90% with the monolith rewrite. Also, they resolved the scalability issues.
As a result, they were able to offer a better customer experience with improved stream quality.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
Consider subscribing to developing.dev newsletter from
. He is a Staff Software Engineer at Instagram. The newsletter offers a wealth of information on fast career growth. Disclaimer: I have put a referral link because I'd love to appear on his leaderboard. He is giving away great books for referrals. But I have already read them.Word-of-mouth referrals like yours help this community grow - Thank you.
Get featured in the newsletter: Write your feedback on this post. And tag me on Twitter, LinkedIn, and Substack Notes. Or, you can reply to this email with anonymous feedback.
References
Marcin Kolny. (2023, March 22). Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%. primevideotech
Sathya Balakrishnan, Ihsan Ozcelik. (2023, February 01). How Prime Video Uses Machine Learning to Ensure Video Quality. Prime Video Tech Blog.
Richard Jones. (2023, February 22). How Prime Video Troubleshoots Quickly and Cost-Effectively at Scale. Prime Video Tech Blog
Sathya Balakrishnan, Ihsan Ozcelik. (2022, March 04). How Prime Video Uses Machine Learning to Ensure Video Quality. Amazon Science Blog.
Cloudflare. (n.d.). Why Use Serverless? Cloudflare Learning.
Ryan Frankel. (2023, March 24). AWS S3 Pricing: How to Save Big on Cloud Storage Costs. HostingAdvice.
Amazon Web Services, Inc. (n.d.). AWS Step Functions Pricing. [Amazon Web Services].
Irakli Nadareishvili, Ronnie Mitra. (2020, December 8), Microservices: Up and Running (pp. Chapter 4). O'Reilly Media.
What is the difference between Amazon ECS and Amazon EC2? (2016, October 31). In Stack Overflow.
Allocate Memory for Amazon ECS Tasks. (n.d.). In AWS Knowledge Center.
Photo by Thibault Penin on Unsplash
I'm still not sure why everyone gets so concerned with "monolith" and "microservice" when actually all that matters is the service boundary. The largest monoliths of our day still have contracts between modules that have to be conformed to.
There is a race to the bottom when people think making your services ever more granular is the way to achieve scale and cost effectiveness. Where does this end? Do we eventually process each and every command or statement on a new "micro command as a service"?