How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines how PayPal scaled to a billion daily transactions with only 8 virtual machines. If you want to learn more, scroll to the bottom and find the references.
Share this post with somebody who wants to study system design & I'll send you some rewards for the referrals.
December 1998 - California, United States.
A team of engineers creates security software for hand-held devices.
Yet their business model failed.
So they pivoted to create an online payment service and called it PayPal.
As the number of users grew in the early days, they bought newer hardware to scale.
But transistors in an integrated circuit (IC) stopped doubling every other year.
Put another way, the single-threaded performance gain of Moore’s law slowed down. So buying newer hardware couldn't solve their scalability problems.
Yet their growth rate was explosive.
And hit 1 million transactions a day in the next 2 years.
So they scaled out by running services in more than 1000 virtual machines.
Although they solved the scalability issue, it created new problems.
Here are some of them:
1. Network Infrastructure
A request took more network hops to finish and it worsened latency.
Also it became expensive to maintain the network infrastructure.
2. Maintainance Costs
Adding more servers increased their infrastructure complexity.
Besides the service deployment across every machine took more time.
And setting up autoscaling needed extra effort.
Also infrastructure management like monitoring became difficult.
3. Resource Usage
They didn’t fully use the CPU of each server.
Put another way, the server throughput was low.
It resulted in resource wastage and extra costs.
Actor Model
The code doesn’t take full advantage of the hardware unless it’s run concurrently.
Also they wanted to keep it simple and scalable.
So they moved to the actor model based on the Akka framework (Java).
It allowed them to support a billion daily transactions with only 8 virtual machines.
The actor model is a conceptual concurrent computation model.
And an actor is the fundamental unit of computation.
Here’s how the actor model offers extreme scalability:
1. Resource Usage
An actor is an extremely lightweight object. It takes fewer resources than threads.
So it’s easy to create millions of actors if necessary.
The actor does an action when a message is received.
Yet the actor is decoupled from the source of the message.
A thread gets assigned to an actor when it must process a message.
While the thread is released after the message is processed and gets assigned to another actor.
The number of threads is proportional to the number of CPU cores.
Yet a small number of threads can handle a large number of concurrent actors.
Because a thread gets assigned to an actor only during its runtime.
2. State Information
Actors don’t share memory and are isolated from each other.
Put another way, the state of an actor is private.
They communicate with each other through messages.
Messages are simple and immutable data structures that get sent over the network.
Each actor has a mailbox. It’s like a message queue.
Actors store messages in the mailbox until they get processed in a First-in First-out (FIFO) order.
Also actors allow the system to avoid extra network calls to a distributed cache or a database.
Because they store the local state in the application server.
Put another way, a stateful application server improves performance. Because it caches data locally.
Besides PayPal uses consistent hashing to route a customer to the same server.
3. Concurrency
Many actors can run at the same time but each actor process messages sequentially.
Put another way, an actor can process only a single message at a time.
So they need 3 actors to process 3 messages in parallel.
Also actors work asynchronously. In other words, they don’t wait for another actor's response.
So the actor model makes concurrency easier.
Besides PayPal uses the functional programming style of Akka for scalability.
It prevents side effects and makes testing easier.
Also they use pluggable code pieces with functional programming for easy scalability.
The actors could run locally or remotely on another machine.
Yet it’s transparent to the system.
4. Fault Tolerance
An actor can create more actors and also supervise them.
The supervisor actor restarts the supervised actor if it fails. Also the message can be routed to another actor.
Besides errors propagate to the supervisor actor.
So graceful error handling can be done without code clutter.
The actor model is not a silver bullet to scalability.
It introduces a learning curve for the developers.
Also extra care should be taken to prevent race conditions and deadlocks.
The actor model allowed PayPal to handle extreme scale with fewer resources.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
https://medium.com/paypal-tech/squbs-a-new-reactive-way-for-paypal-to-build-applications-127126bf684b
https://en.wikipedia.org/wiki/PayPal
https://akka.io/
https://finematics.com/actor-model-explained/
https://www.brianstorti.com/the-actor-model/
Looks like Erlang/Elixir’s OTP model.
Actor model is rarely used, we use it in Scala in some of our projects. I think its underrated.