System Design Newsletter

Share this post

Slack Architecture That Powers Billions of Messages a Day

newsletter.systemdesign.one

Discover more from System Design Newsletter

Weekly newsletter on system design. Get the powerful system design template for FREE
Over 18,000 subscribers
Continue reading
Sign in

Slack Architecture That Powers Billions of Messages a Day

#18: Read Now - Awesome Slack Architecture (4 minutes)

NK
Oct 26, 2023
25
Share this post

Slack Architecture That Powers Billions of Messages a Day

newsletter.systemdesign.one
1
Share

Get the powerful template to approach system design for FREE on newsletter sign-up:


Slack is a real-time messaging app.

They built Slack with 2 API types: web and real-time.

The web API handles user sessions over HTTP.

While the real-time API handles chat messages, typing indicators, and presence status. The real-time API uses WebSockets for bidirectional communication.

Messaging Architecture; WebSockets
Real-Time API with WebSockets

I think HTTP/2 combined with Server-Sent Events is another option for bidirectional communication. Because HTTP/2 offers multiplexing by reusing the same TCP connection.

Also Slack API gets paginated to reduce latency and bandwidth usage. And they do it using cursor-based pagination.

Cursor-based pagination works by maintaining a pointer to a specific item in an ordered dataset. And client requests include the pointer to get only items after that.

Slack runs on the LAMP (Linux-Apache-MySQL-PHP) stack.

Messaging Architecture; Slack Tech Stack
Slack Tech Stack

They used MySQL database to store chat messages because:

  • It’s a proven technology

  • It offers mature tooling support

  • There are many experienced engineers available in the SQL domain

  • The relational data model is a great discipline

But MySQL by default favors strong consistency. So they set up MySQL as an eventually consistent database for high availability.

Also Slack desktop client got built with ElectronJS and ReactJS.

Messaging Architecture

A messaging app needs 3 things to work:

  • Validity: every user gets the published message

  • Integrity: a chat message doesn’t get delivered to a user more than once

  • Total order: chat messages get delivered in the same order for every user

This is similar to atomic broadcast in distributed systems and it’s impossible to implement.

So they loosened some constraints to create Slack. They did it by relaxing the end-to-end property of the system based on usage patterns.

They built Slack with a client-server architecture.

Messaging Architecture; Slack System Design
Slack Architecture

The chat server is a PHP monolith that does CRUD operations on the chat database.

The gateway server is a stateful in-memory service. It pushes chat messages to the client over WebSockets.

Also consistent hashing maps Slack channels to gateway servers.

Vitess shards MySQL with channel-id as the shard key to reduce conflicts.

Vitess is a topology management service for MySQL and helps to scale out easily.

They do service registry using Consul. A service registry lets services find each other and communicate.

The job queue defers non-critical tasks like indexing chat messages in search. They created their own job queue without using a third-party solution like Kafka. Because they wanted to meet their needs with little operational complexity.

They do SSL termination with Envoy Edge proxy. Besides it provides hot restarts to achieve high availability.

A hot restart works by avoiding client connection drops on code change deployments.

Yet Slack’s initial payload size increased as the number of users grew. And resulted in high latency. So they created a new service (snapshot service) to get low latency and high performance. It’s an application-level edge query engine backed by a cache server.

The snapshot service provided just-in-time annotation. It does it by predicting data objects that might get queried next by the client. And pushes the data objects proactively to prevent an extra network call.

Besides mobile users reply to a chat message using the web API. It allowed mobile users to avoid the extra work of creating a WebSocket connection.

They salted the chat messages to prevent the same message shown more than once. The salt is a unique but random token.

Also they installed load balancers between different system components. And stored the media files shared in chat messages in AWS S3. The frequently accessed media assets get cached in CDN to reduce latency.

They added logic to fetch the newer chat messages from the server using the last-seen timestamp of the user.

The logical clock (vector clock) preserves the ordering of chat messages.

The logical clock finds the causal relationships between events in a distributed system. And does it by including a counter that gets incremented on every chat message.

Besides Thrift data serialization format gave them high performance.

Slack App Workflow

Messaging Architecture; Slack App Workflow
Messaging Workflow

I'll summarize the Slack messaging workflow. The client delivers a message to the chat server. The chat server then requests the job queue to index the message in search. The message then gets routed to the gateway server using consistent hashing.

Also the gateway server keeps an on-disk buffer of uncommitted sends. Because it helps to recover from crashes and keeps the core Slack always operational.


Slack grew to support billions of messages a day. And run 5 Million simultaneous sessions at peak.

My takeaways from this case study are:

  • Optimality changes with growth and it’s important to find the end-to-end part of the problem

  • Complexity isn’t bad if it solves a problem


Consider subscribing to get simplified case studies delivered straight to your inbox:


Author NK; System design case studies
Follow me on LinkedIn and Twitter

Thank you to everybody who supports this newsletter. Consider sharing this post with your friends and get rewards.

System design newsletter

Share


How Shopify Handles Flash Sales at 32 Million Requests per Minute

How Shopify Handles Flash Sales at 32 Million Requests per Minute

NK
·
Oct 19
Read full story
How LinkedIn Scaled to 930 Million Users

How LinkedIn Scaled to 930 Million Users

NK
·
Oct 17
Read full story

  • https://gotoams.nl/2018/sessions/440/scaling-slack

  • https://systemdesign.one/slack-architecture/

  • https://slack.engineering/real-time-messaging/

  • Photo by Scott Webb on Unsplash

25
Share this post

Slack Architecture That Powers Billions of Messages a Day

newsletter.systemdesign.one
1
Share
Previous
Next
1 Comment
Share this discussion

Slack Architecture That Powers Billions of Messages a Day

newsletter.systemdesign.one
Jignesh Patil
Writes Jignesh’s Substack
Oct 31Liked by NK

Salting messages, great. It will be great to know the relationship between messages of users and admin privilege.

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 NK
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing