Wechat Architecture That Powers 1.67 Billion Monthly Users
#17: Read Now - Architecture Overview Chat Application (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
WeChat is the most popular messaging app in China with 1.67 billion monthly active users.
They went from 491 concurrent users in 2011 to 400+ million concurrent users.
Their development principle was simple: release first and then optimize. And it allowed them to validate the product idea before scaling it.
Chat Application Architecture
When a message gets sent, the WeChat server stores it because the receiver might be offline. And a notification gets delivered to the receiver when it's back online again.
The receiver then fetches the message from the server.
They used an asynchronous message queue to deliver chat messages to the receiver. Because processing time is higher in some use cases.
For example, message delivery in group chat is a time-consuming operation.
Also they restricted the number of people in a group chat to 500. And tracked each person’s last read message index to deliver unread messages.
WeChat servers run C++. And services communicate with each other via Svrkit: an RPC framework.
They built WeChat with N-layer Architecture: Access, Logic, and Storage.
N-layer architecture partitions application logic into different logical layers and offers scalability.
It handled network calls: client-initiated requests and server-initiated pushes.
I don’t know the network protocol that WeChat uses but WebSocket is a good choice. Because it offers bi-directional communication between the client and the server.
It contains business logic and provides an abstract interface to the client.
They broke the business logic into separate modules based on functionality and importance. And kept the modules independently deployable.
It provided data access and included databases: MySQL and SDB.
SDB is a simple string key-value database.
Besides they used Memcached to improve data access efficiency and reduce database calls.
They stored the last-read message index and contacts of a person in SDB. Because it's highly performant and reliable.
And ran SDB in the leader-follower (asynchronous) replication topology. SDB followers handled reads when the leader crashed.
They stored account data and chat messages in MySQL. And ran MySQL in multi-leader topology to scale writes.
They routed reads to MySQL followers. But reads that needed strong consistency got routed to MySQL leader.
Yet replication lag and the single point of failure remained a problem with this setup.
So they installed the KVSvr algorithm on top of MySQL and SDB.
KVSvr is a distributed algorithm based on the Quorum protocol. It gives a strong consistency guarantee, asynchronous data replication, and high write performance. Also it cached data for improved read performance.
A person’s messages, contacts, and account data get stored on the server. And the client needs to synchronize it.
So they created a snapshot of the data on the server and sent it to the client. The snapshot consisted of key-value pairs.
Multi Data Center
Their initial data center was in Shanghai and they wanted to grow further. So they installed extra data centers in Hong Kong and Canada.
They wanted each data center to be self-independent. So they deployed every service in each data center. It allowed them to route traffic to a healthy data center if one of them failed.
Yet data consistency between data centers remained a problem. Because the latency between data centers was high.
Also they needed to avoid business logic problems due to eventual consistency.
So they segmented users. The traffic from China got routed to the Shanghai data center. While international users got routed to data centers outside China.
And data got replicated asynchronously between data centers. This setup reduced complexity and prevented consistency problems.
They used a Quorum-based queue to synchronize data between data centers for reliability.
And coordinated operations across data centers for special cases such as global unique account ID creation.
They decided to use the eventual consistency model for group chat because it met their needs.
WeChat was extended to support voice messages, games, and mobile payment. And it has become China's app for everything.
There are still many open questions about WeChat architecture. But I couldn't find extra information about it. So please share in the comments if you find anything helpful.
Consider subscribing to get simplified case studies delivered straight to your inbox:
A big thank you to everybody who supports this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.