How Razorpay Scaled to Handle Flash Sales at 1500 Requests per Second
#46: A Case Study on Payment Gateway Scalability (5 min read)
Get the powerful template to approach system design for FREE on newsletter sign-up:
2020 - India.
IPL, the most famous cricket league in the world is about to start.
And more than 23 million raving fans of cricket in India will stream it.
While companies sell food at a discount via flash sales minutes before the game starts.
And accepts online payment via Razorpay, a shiny payment gateway service.
Flash sales create a traffic spike with transactions reaching 1500 requests per second.
Although it’s possible to serve this traffic, scaling the infrastructure quickly to handle it could be difficult.
This post outlines how Razorpay scales to handle flash sales. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with someone who wants to study system design.
Note: This post is based on my research and may differ from real-world implementation.
Refind – Brain food, delivered daily (Featured)
Loved by 450,000+ curious minds. Every day Refind analyzes thousands of articles and sends you only the best. Subscribe for free today.
Payment Gateway Architecture
Here are their scalability techniques for flash sale:
1. Rate Limit the Traffic
They rate limit the traffic to prevent server overload.
And use a Nginx proxy server as the rate limiter. It gets deployed as a sidecar and runs a dedicated cache for rate limiting. Imagine the sidecar pattern as extending a service by attaching an extra container.
Besides they use the fixed window algorithm for efficient rate limiting. It uses a single atomic counter per key with an expiry time (TTL).
2. Connection Pooling
They use MySQL as the main database. While clients compete with each other for a database connection during flash sales.
They run PHP on the application layer. But PHP uses a process model for execution. So sharing resources between processes isn’t possible. Hence PHP doesn’t support database connection pooling natively.
This means the application layer holds database connections while waiting for results. So connection starvation is likely to happen if the queries get expensive.
Also the database performance degrades if the number of idle connections increases.
So they use ProxySQL as a database proxy. It holds a pool of connections to the database.
And a tenant connects to ProxySQL instead of MySQL directly. Thus limiting the number of MySQL connections to avoid connection starvation.
Think of a tenant as an isolated data space for a specific user.
Also constant opening and closing of database connections could be expensive. While ProxySQL prevents it via persistent connections.
Besides ProxySQL caches the query results for low latency.
They deploy ProxySQL as a sidecar to keep the application layer stateless. And set up a fallback to connect directly with MySQL if ProxySQL fails for high availability.
3. Avoid the Thundering Herd
Thundering herd occurs when many clients query the server concurrently during flash sales. That means bad performance and downtime.
So they use these techniques to prevent the thundering herd problem:
Throttle incoming traffic
Add exponential backoff by the client
Include caching
Besides they use ProxySQL to throttle tenants issuing expensive database queries.
4. Autoscaling Isn’t Enough
It takes around 4 minutes for a newly provisioned server to become healthy. So they don't rely only on autoscaling to handle traffic spikes.
Instead they prewarm their infrastructure. And run baked container images to reduce the deployment time.
They do capacity planning based on estimated transactions and scale their servers horizontally. Also they scale down the infrastructure after flash sales with autoscaling.
5. Smart Routing
They should forward the traffic only to external bank gateways that are operational.
So they use routing rules based on machine learning. It considers the success and failure events from payments. And then predicts the success probability of each external gateway.
6. Testing
They must resolve system bottlenecks for better performance.
So they do load testing using an open-source tool called k6. It checks the system's performance under an expected load. And provides information about latency and throughput.
7. Flywheel Effect
They profile the system for bottlenecks.
And put their learning in a constant loop. It helped to improve their performance.
They run critical services like payments and orders in separate microservices. Because it gives scalability.
While this case study shows that simple and proven techniques can solve most scalability problems.
PS - Are you a software engineer who wants to make better architecture decisions at work? But sometimes you don't understand how systems work together.
Don't worry. I'm working on deep-dive content to help you solve this problem. It will be available to paid subscribers of this newsletter.
Yet the subscription fees will be higher than the current pledge fees. So consider pledging now if you need access to deep-dive content at a lower price.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
Great lessons on scalability @systemdesignone
Now every time I read "thundering herd" I'll think of a flash sale and people running.
Good article, Neo!