Discover more from System Design Newsletter
How McDonald’s Food Delivery Platform Handles 20,000 Orders per Second
#44: Break Into McDonald’s Architecture (7 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines the architecture of McDonald’s food delivery platform. If you want to learn more, scroll to the bottom and find the references.
Share this post with somebody who wants to study system design & I'll send you some rewards for the referrals.
Note: This post is based on my research and may differ from real-world implementation.
March 2024 - Berlin, Germany.
Maria is taking a lunch break at the end of her shift at the hospital.
But she forgot to pack lunch that day.
So she was hungry and sad.
She decides to order food from McDonald’s through the mobile app.
Refind – Brain food, delivered daily (Featured)
Loved by 450,000+ curious minds. Every day Refind analyzes thousands of articles and sends you only the best. Subscribe for free today.
They use hexagonal architecture to keep complexity low.
Imagine hexagonal architecture as a pattern that separates the core application from external services like databases and user interfaces.
The application domain lives inside the hexagon. And business logic follows domain-driven design (DDD) principles, which means it shouldn’t leak outside the hexagon.
Ports are interfaces through which the application interacts with external services. While Adapters are implementations of the ports. This means ports define the contract or interface. While adapters provide implementation to fulfill the contract.
Besides they use event-driven architecture alongside hexagonal architecture. It provides modularity and keeps the architecture loosely coupled.
Imagine event-driven architecture as a pattern that uses events for communication between services. The producer publishes events to a message broker. While the consumer subscribes to those events and reacts accordingly.
So they represent the domain logic using the hexagonal core. While adapters produce and consume events.
They publish an event to the message broker when a domain event occurs. And external services subscribe to the message broker for events.
They use a schema registry to maintain well-defined contracts for events. Also it helps with schema validation. Besides they cache the event schema in services for performance. A schema describes the expected data fields and types of an event.
Also they run a standby database to prevent data loss if the message broker becomes unavailable. So the events get written to the standby database and get published back to the message broker after it's healthy again.
Besides they route the events to a dead-letter topic if schema validation fails. And then use a utility tool to fix those events.
They serve around 37 thousand locations and 64 million people each day.
Scalability is a difficult problem. And scalability with a distributed network at large volumes is even more difficult.
Yet they scaled their platform to 20,000 orders per second with less than 100 millisecond latency.
McDonald's Architecture
Here’s how McDonald’s scaled their food delivery platform:
1. Choosing Food From the Menu
Maria opens the mobile app and sees a menu with food items.
They use a reverse proxy server to host all API endpoints and use it to route requests to microservices. This means an API gateway pattern with REST APIs.
They store the menus and restaurant's working hours in an SQL database. And query the SQL database using a serverless function to create menus in JSON data format.
They show the user the menu of available food items and prices via HTTP response.
2. Getting a Discount With Loyalty Rewards
Maria selects a Big Mac burger from the menu.
She then remembers the points she earned through past purchases.
So she applies them and gets a discount.
They publish user events into a message queue. It allows services to talk to each other via an asynchronous pattern. Also a first-in-first-out (FIFO) queue guarantees ordering and exactly-once processing of transactions.
They use a serverless function to poll the message queue and then process the messages in it.
They use serverless to add new features without worrying about infrastructure. Also it helps them to reduce costs by avoiding server provisioning for maximum capacity.
3. Ordering Food
Maria places the order.
They create a WebSocket connection for bidirectional communication with the client. And the client sends an event to the API gateway when a new order gets created.
The API gateway then forwards that event to an event bus. It has a routing system that can identify new orders. Think of the event bus as a central hub for messages between services.
The event bus triggers a serverless function to check whether the ordered food is available at the restaurant. Put simply, it validates if a specific order could be completed.
Afterward they send an event to the client when the restaurant accepts the order.
They use an in-memory cache with Redis to process orders. It provides high performance and low latency. Also they back up Redis in a relational database for durability in case of an outage.
While the serverless function that validated the order waits there for a callback. It's used to notify the user when the food is ready to be picked up by the delivery driver. Besides they store the serverless function’s callback task token in a key-value database.
They use the estimated time of arrival (ETA) and food preparation time to notify the driver of a pick-up. It helps to avoid extra waiting time.
They use a separate serverless function to trigger an event as the driver gets near the restaurant. The event bus would identify if it’s an event for an existing order and if so query the key-value database for the callback task token.
And then it forwards that event to the waiting serverless function. The waiting serverless function will then release the order. Also it notifies the user that the driver has picked up their food.
4. Giving Feedback
Maria was happy with the burger she had.
While she receives a feedback survey notification from McDonald’s.
They need data about user experience to improve their service.
So they run surveys and collect social media comments. And do extract-transform-load (ETL) to analyze information. The data then gets stored in S3 object storage.
Afterward they sent the data to natural language processing (NLP) service for sentiment analysis. Also the processed data is sent to in-house models to improve its accuracy. Finally the result data gets stored in a data warehouse to create operational reports.
Their food delivery platform runs microservices to support more features. And uses a load balancer to distribute the traffic across microservices.
Yet each microservice has different scale and runtime profiles. That means customer-facing services are considered critical. While background processing services are tolerable to failures.
They moved services that change often into separate microservices. So they could deploy and iterate faster.
They do smoke testing to check the responsiveness of their API. Imagine smoke testing as a quick and basic test to check functionality.
Besides they use circuit breaker logic and exponential backoff for resilience. And focus on automation to reduce operational efforts.
While McDonald's remains one of the world's largest fast-food restaurant chains.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
Delivering agility at McDonald's with microservices transformation
Enhancing Loyalty rewards: How McDonald’s leverages AWS Lambda for microservices
McDonald’s event-driven architecture: The data journey and how it works
Taco Bell: Order Middleware - Enabling Delivery Orders at Massive Scale
FoodHub: Enabling Massive Scale Order Processing with Serverless Architecture
Taco Bell: Aurora as The Heart of the Menu Middleware and Data Integration Platform for Taco Bell
Subscribe to System Design Newsletter
A weekly newsletter to help you pass system design interview and become good at work
It's interesting their use of events for things like discounts and placing orders.
My intuition was to treat async communication as something that can take more time. I bet they'll have tight alarm thresholds if the backlog of messages starts growing. In the end, a good customer experience requires low latency in those flows.
Good read, Neo!
Can someone simplify the point two ? How it works ?