How Zapier Automates Billions of Tasks
#37: Learn More - Zapier Architecture Overview (5 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines Zapier architecture. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with someone who wants to study system design.
Once upon a time, there lived an office assistant named Sophie.
She was bright and smart.
But she got exhausted from repetitive office tasks.
Until one day when she hears about Zapier from a coworker.
She automated the workflow that occurs frequently when an event gets created in her office Google Calendar.
It sends Slack notifications and adds rows to Google Sheets automatically.
And she was stunned by its ease of automation.
Zapier Architecture
Here's how Zapier automates billions of tasks:
1. Tech Stack
They run the Nginx web server and Python Django framework on the backend.
And stores data in MySQL and Redis. Put another way, Zaps gets stored in MySQL.
Zap is an automated workflow that connects different tasks or services.
While MySQL is a relational database management system.
They store the number of in-flight tasks in Redis. It allows them to throttle.
Redis is an in-memory key-value database.
While AWS Lambda runs custom scripts provided by the user.
Lambda is a serverless computing platform.
2. Zap Implementation
They use the directed rooted tree to create a workflow.
While each tree node represents a task.
And directed rooted tree is implemented in MySQL for simplicity.
A directed rooted tree is a directed graph with all edges pointing away from the root node.
Also tasks are kept independent of each other. Put another way, a task consumes data from the file system, performs API calls, and returns results.
So they’re unaware of their positioning in the workflow.
And a workflow engine orchestrates tasks. In other words, it decides the task execution order based on the directed rooted tree.
They store the session data of task execution in a dedicated MySQL. It’s used as a key-value store with softer consistency requirements and offers low operational complexity.
Besides they use MySQL read-only replicas to handle long-running background tasks. Because these tasks wouldn’t change often and a replication lag is tolerable.
3. Asynchronous Processing
A long-living but idle connection to the web server consumes resources. Thus it’s expensive.
So they use a message queue to avoid waiting for a request to finish. It prevents problems due to timeouts and resource bottlenecks.
They use RabbitMQ and Celery to create a distributed workflow engine. Put another way, it’s used to schedule background tasks.
RabbitMQ is a lightweight message queue.
While Celery is an asynchronous task queue based on distributed message passing. And it supports real-time operations and scheduling.
Celery sends messages to workers using RabbitMQ.
Put another way, Celery is a task management framework. It provides a high-level API to schedule and trigger tasks.
While RabbitMQ provides a low-level API to do the same things. And RabbitMQ is one of the many backends for Celery.
They send the task ID to the message queue and the worker gets the task data from the database.
And the worker executes the task.
4. Zap History
Zap history shows a user the list of tasks that ran in their account.
They use GraphQL and Next.js API routes to get the Zap history.
While Python Django runs on the backend.
GraphQL is a data query language for APIs and a query runtime engine.
They store the results of a Zap execution in AWS S3 and emit an event to Kafka.
The emitted event contains enough information to process the execution result.
AWS S3 is an object storage.
While Kafka is a distributed event store and stream-processing platform.
They use the indexer service to consume the events from Kafka. Also it downloads the relevant data from S3.
And they index the processed Zap execution data in the Elasticsearch cluster.
Put another way, ElasticSearch stores the historical activity of Zaps.
Elasticsearch is a search engine based on the Lucene library. It offers a text search functionality with an HTTP web interface.
5. Scalability
They use a combination of auto-scaling and auto-replacement for resilience.
Besides they scale horizontally and replicate infrastructure for high availability.
They use jitter to handle spikes in workload when many tasks get scheduled for the same time.
Put another way, they don’t guarantee that every task will run at the exact time.
They enqueue tasks on RabbitMQ. And the tasks get consumed by workers running on Kubernetes.
Kubernetes is a container orchestration system for automating software deployment, scaling, and management.
They scale the workers based on CPU usage and the number of ready tasks in RabbitMQ. Thus allowing them to handle the varying load.
Zapier remains one of the leading automation tools in the market.
And this case study indicates that a simple tech stack with proven technologies is enough for high scalability.
Consider subscribing to get simplified case studies delivered straight to your inbox:
NK’s Recommendations
Leading Developers: If you are a Development Team Leader, Engineering Manager, or considering that career path - try this newsletter.
Author:
High Growth Engineer: Get actionable tips to grow faster in your software engineering career.
Author:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
I was curious to see that Django is still used in modern software 😅
Why did they choose to store the workflow DAGs in MySQL?