The System Design Newsletter

The System Design Newsletter

Share this post

The System Design Newsletter
The System Design Newsletter
Slack Architecture That Powers Billions of Messages a Day
User's avatar
Discover more from The System Design Newsletter
Download my system design playbook for free on newsletter signup
Over 161,000 subscribers
Already have an account? Sign in

Slack Architecture That Powers Billions of Messages a Day

#18: Read Now - Awesome Slack Architecture (4 minutes)

Neo Kim's avatar
Neo Kim
Oct 26, 2023
37

Share this post

The System Design Newsletter
The System Design Newsletter
Slack Architecture That Powers Billions of Messages a Day
1
1
Share

Get my system design playbook for FREE on newsletter signup:


This post outlines Slack's architecture. If you want to learn more, scroll to the bottom and find the references.

  • Share this post & I'll send you some rewards for the referrals.

Slack is a real-time messaging app.

They built Slack with 2 API types: web and real-time.

The web API handles user sessions over HTTP.

While the real-time API handles chat messages, typing indicators, and presence status. The real-time API uses WebSockets for bidirectional communication.

Messaging Architecture; WebSockets
Real-Time API with WebSockets

I think HTTP/2 combined with Server-Sent Events is another option for bidirectional communication. Because HTTP/2 offers multiplexing by reusing the same TCP connection.

Also Slack API gets paginated to reduce latency and bandwidth usage. And they do it using cursor-based pagination.

Cursor-based pagination works by maintaining a pointer to a specific item in an ordered dataset. And client requests include the pointer to get only items after that.

Slack runs on the LAMP (Linux-Apache-MySQL-PHP) stack.

Messaging Architecture; Slack Tech Stack
Slack Tech Stack

They used MySQL database to store chat messages because:

  • It’s a proven technology

  • It offers mature tooling support

  • There are many experienced engineers available in the SQL domain

  • The relational data model is a great discipline

But MySQL by default favors strong consistency. So they set up MySQL as an eventually consistent database for high availability.

Also Slack desktop client got built with ElectronJS and ReactJS.

Messaging Architecture

A messaging app needs 3 things to work:

  • Validity: every user gets the published message

  • Integrity: a chat message doesn’t get delivered to a user more than once

  • Total order: chat messages get delivered in the same order for every user

This is similar to atomic broadcast in distributed systems and it’s impossible to implement.

So they loosened some constraints to create Slack. They did it by relaxing the end-to-end property of the system based on usage patterns.

They built Slack with a client-server architecture.

Messaging Architecture; Slack System Design
Slack Architecture

The chat server is a PHP monolith that does CRUD operations on the chat database.

The gateway server is a stateful in-memory service. It pushes chat messages to the client over WebSockets.

Also consistent hashing maps Slack channels to gateway servers.

Vitess shards MySQL with channel-id as the shard key to reduce conflicts.

Vitess is a topology management service for MySQL and helps to scale out easily.

They do service registry using Consul. A service registry lets services find each other and communicate.

The job queue defers non-critical tasks like indexing chat messages in search. They created their own job queue without using a third-party solution like Kafka. Because they wanted to meet their needs with little operational complexity.

They do SSL termination with Envoy Edge proxy. Besides it provides hot restarts to achieve high availability.

A hot restart works by avoiding client connection drops on code change deployments.

Yet Slack’s initial payload size increased as the number of users grew. And resulted in high latency. So they created a new service (snapshot service) to get low latency and high performance. It’s an application-level edge query engine backed by a cache server.

The snapshot service provided just-in-time annotation. It does it by predicting data objects that might get queried next by the client. And pushes the data objects proactively to prevent an extra network call.

Besides mobile users reply to a chat message using the web API. It allowed mobile users to avoid the extra work of creating a WebSocket connection.

They salted the chat messages to prevent the same message shown more than once. The salt is a unique but random token.

Also they installed load balancers between different system components. And stored the media files shared in chat messages in AWS S3. The frequently accessed media assets get cached in CDN to reduce latency.

They added logic to fetch the newer chat messages from the server using the last-seen timestamp of the user.

The logical clock (vector clock) preserves the ordering of chat messages.

The logical clock finds the causal relationships between events in a distributed system. And does it by including a counter that gets incremented on every chat message.

Besides Thrift data serialization format gave them high performance.

Slack App Workflow

Messaging Architecture; Slack App Workflow
Messaging Workflow

I'll summarize the Slack messaging workflow. The client delivers a message to the chat server. The chat server then requests the job queue to index the message in search. The message then gets routed to the gateway server using consistent hashing.

Also the gateway server keeps an on-disk buffer of uncommitted sends. Because it helps to recover from crashes and keeps the core Slack always operational.


Slack grew to support billions of messages a day. And run 5 Million simultaneous sessions at peak.

My takeaways from this case study are:

  • Optimality changes with growth and it’s important to find the end-to-end part of the problem

  • Complexity isn’t bad if it solves a problem


Consider subscribing to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn | YouTube | Threads | Twitter | Instagram

Thank you to everybody who supports this newsletter. Consider sharing this post with your friends and get rewards.

system design newsletter

Share


How Shopify Handles Flash Sales at 32 Million Requests per Minute

How Shopify Handles Flash Sales at 32 Million Requests per Minute

NK
·
October 19, 2023
Read full story
How LinkedIn Scaled to 930 Million Users

How LinkedIn Scaled to 930 Million Users

NK
·
October 17, 2023
Read full story

References

  • https://gotoams.nl/2018/sessions/440/scaling-slack

  • https://systemdesign.one/slack-architecture/

  • https://slack.engineering/real-time-messaging/

  • Photo by Scott Webb on Unsplash

Rizqy Hidayat's avatar
wqwq's avatar
Ninja's avatar
Anton Zaides's avatar
GurunathRajagopal's avatar
37 Likes∙
1 Restack
37

Share this post

The System Design Newsletter
The System Design Newsletter
Slack Architecture That Powers Billions of Messages a Day
1
1
Share

Discussion about this post

User's avatar
Jignesh Patil's avatar
Jignesh Patil
Oct 31, 2023

Salting messages, great. It will be great to know the relationship between messages of users and admin privilege.

Expand full comment
Like (1)
Reply
Share
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
#1: Learn More - Awesome WhatsApp Engineering (6 minutes)
Aug 27, 2023 â€¢ 
Neo Kim
745

Share this post

The System Design Newsletter
The System Design Newsletter
8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers
25
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
#30: Learn More - Awesome PayPal Engineering (4 minutes)
Dec 26, 2023 â€¢ 
Neo Kim
256

Share this post

The System Design Newsletter
The System Design Newsletter
How PayPal Was Able to Support a Billion Transactions per Day With Only 8 Virtual Machines
14
How Stripe Prevents Double Payment Using Idempotent API
#45: A Simple Introduction to Idempotent API (4 minutes)
May 9, 2024 â€¢ 
Neo Kim
395

Share this post

The System Design Newsletter
The System Design Newsletter
How Stripe Prevents Double Payment Using Idempotent API
30

Ready for more?

© 2025 Neo Kim
Publisher Privacy
Substack
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.