Discover more from System Design Newsletter
Slack Architecture That Powers Billions of Messages a Day
#18: Read Now - Awesome Slack Architecture (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines Slack architecture. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with someone who wants to study system design.
Slack is a real-time messaging app.
They built Slack with 2 API types: web and real-time.
The web API handles user sessions over HTTP.
While the real-time API handles chat messages, typing indicators, and presence status. The real-time API uses WebSockets for bidirectional communication.
I think HTTP/2 combined with Server-Sent Events is another option for bidirectional communication. Because HTTP/2 offers multiplexing by reusing the same TCP connection.
Also Slack API gets paginated to reduce latency and bandwidth usage. And they do it using cursor-based pagination.
Cursor-based pagination works by maintaining a pointer to a specific item in an ordered dataset. And client requests include the pointer to get only items after that.
Slack runs on the LAMP (Linux-Apache-MySQL-PHP) stack.
They used MySQL database to store chat messages because:
It’s a proven technology
It offers mature tooling support
There are many experienced engineers available in the SQL domain
The relational data model is a great discipline
But MySQL by default favors strong consistency. So they set up MySQL as an eventually consistent database for high availability.
Also Slack desktop client got built with ElectronJS and ReactJS.
Messaging Architecture
A messaging app needs 3 things to work:
Validity: every user gets the published message
Integrity: a chat message doesn’t get delivered to a user more than once
Total order: chat messages get delivered in the same order for every user
This is similar to atomic broadcast in distributed systems and it’s impossible to implement.
So they loosened some constraints to create Slack. They did it by relaxing the end-to-end property of the system based on usage patterns.
They built Slack with a client-server architecture.
The chat server is a PHP monolith that does CRUD operations on the chat database.
The gateway server is a stateful in-memory service. It pushes chat messages to the client over WebSockets.
Also consistent hashing maps Slack channels to gateway servers.
Vitess shards MySQL with channel-id as the shard key to reduce conflicts.
Vitess is a topology management service for MySQL and helps to scale out easily.
They do service registry using Consul. A service registry lets services find each other and communicate.
The job queue defers non-critical tasks like indexing chat messages in search. They created their own job queue without using a third-party solution like Kafka. Because they wanted to meet their needs with little operational complexity.
They do SSL termination with Envoy Edge proxy. Besides it provides hot restarts to achieve high availability.
A hot restart works by avoiding client connection drops on code change deployments.
Yet Slack’s initial payload size increased as the number of users grew. And resulted in high latency. So they created a new service (snapshot service) to get low latency and high performance. It’s an application-level edge query engine backed by a cache server.
The snapshot service provided just-in-time annotation. It does it by predicting data objects that might get queried next by the client. And pushes the data objects proactively to prevent an extra network call.
Besides mobile users reply to a chat message using the web API. It allowed mobile users to avoid the extra work of creating a WebSocket connection.
They salted the chat messages to prevent the same message shown more than once. The salt is a unique but random token.
Also they installed load balancers between different system components. And stored the media files shared in chat messages in AWS S3. The frequently accessed media assets get cached in CDN to reduce latency.
They added logic to fetch the newer chat messages from the server using the last-seen timestamp of the user.
The logical clock (vector clock) preserves the ordering of chat messages.
The logical clock finds the causal relationships between events in a distributed system. And does it by including a counter that gets incremented on every chat message.
Besides Thrift data serialization format gave them high performance.
Slack App Workflow
I'll summarize the Slack messaging workflow. The client delivers a message to the chat server. The chat server then requests the job queue to index the message in search. The message then gets routed to the gateway server using consistent hashing.
Also the gateway server keeps an on-disk buffer of uncommitted sends. Because it helps to recover from crashes and keeps the core Slack always operational.
Slack grew to support billions of messages a day. And run 5 Million simultaneous sessions at peak.
My takeaways from this case study are:
Optimality changes with growth and it’s important to find the end-to-end part of the problem
Complexity isn’t bad if it solves a problem
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you to everybody who supports this newsletter. Consider sharing this post with your friends and get rewards.
References
https://gotoams.nl/2018/sessions/440/scaling-slack
https://systemdesign.one/slack-architecture/
https://slack.engineering/real-time-messaging/
Photo by Scott Webb on Unsplash
Salting messages, great. It will be great to know the relationship between messages of users and admin privilege.