Everything You Need to Know About Gossip Protocol
#25: Learn More - How Distributed Systems Gossip (6 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
The 2 main problems in distributed systems are state management and communication.
A peer-to-peer service like gossip protocol can be used to solve them.
The gossip protocol handles system state with high availability.
Also application-level data can be piggybacked in gossip messages as key-value pairs.
The gossip protocol is also called the epidemic protocol. Because the messages get transferred like how epidemics spread.
Break into tech in 6 weeks 💻 (Featured)
Learn tech skills in 6 weeks to create a portfolio and land a remote job.
Why Use Gossip Protocol?
There are different ways to broadcast a message in distributed systems. They are:
1. Point-To-Point Broadcast
The producer sends the messages directly to the consumer. Also the producer retries if the consumer fails to accept the message.
Yet the message will be lost if both the producer and consumer fail simultaneously.
2. Eager Reliable Broadcast
Each server broadcasts a message to every other server in the system.
Although it’s fault-tolerant, this approach is problematic because:
High bandwidth usage due to O(n^2) messages broadcast to n number of servers
Network bottleneck due to O(n) linear broadcast
Extra storage is needed to maintain the list of nodes
3. Gossip Protocol
Each server periodically sends the messages to a set of random servers. And the entire system will receive a message eventually.
Gossip protocol is a good choice for communication in a large-scale system because:
Each server transfers only a limited number of messages
Limited bandwidth usage
Tolerant to network and server failures
The gossip protocol is reliable because many servers retransmit the messages.
Yet gossip protocol can be used to keep nodes consistent only if:
Operations are commutative
Serializability is not needed
The number of servers that receive a message from a particular server is called the Fanout.
The number of gossip rounds needed to transfer a specific message across the entire system is called the Cycle.
A case study of a gossiping system with 128 servers needed less than 2 percent of CPU and 60 KBps of bandwidth.
Here are some gossip protocol simulations:
Gossip Protocol Properties
There is no formal definition for gossip protocol. But it’s expected to have certain properties:
A peer server must be selected randomly
Each server stores only location information. And is unaware of the entire system state
Interactions between servers are periodic and pairwise
Gossip Algorithm
Each server maintains a list of servers and their metadata.
Here is how the gossip algorithm works:
Gossip periodically to a random server
The server inspects the received gossip message
The server merges the message with the highest version to local data
Gossip Protocol Implementation
The gossip protocol uses the peer sampling service to find the peer servers.
Here is how the peer sampling service works:
Initialize each server with a partial view of the system
Merge the server’s view with a peer server’s view on gossip exchange
A server initiating a gossip exchange sends a gossip digest synchronization message. It contains a list of gossip digests.
Here is a sample schema of gossip digest:
EndPointState:
10.0.1.41
HeartBeatState:
generation: 1259904231, version: 681
ApplicationState:
"average-load": 2.7, generation: 3659909691, version: 42
ApplicationState:
"bootstrapping": pxLpassF9XD8Kymj, generation: 1281909615, version: 91
The gossip messages get sent over User Datagram Protocol (UDP) or Transmission Control Protocol (TCP).
A server is considered healthy if the heartbeat counter keeps incrementing. So the heartbeat counter of a server is incremented on each gossip exchange.
The gossip protocol removes the data from a server using a tombstone. The tombstone is a special data entry to invalidate a data key without the actual deletion of the data.
Gossip Protocol Types
There are 3 types of gossip protocols. They are categorized based on message transfer time and the network traffic created.
1. Anti-Entropy Gossip Protocol
This variant sends an unbounded number of messages without termination.
It’s usually used to reduce the entropy between replicas of a stateful service like the database. The server with the newest message sends it to other servers.
Yet it causes high bandwidth usage due to the transfer of the entire dataset.
2. Rumor-Mongering Gossip Protocol
This variant’s cycle is more frequent compared to the anti-entropy cycle. So it will likely flood the network.
Yet rumor-mongering protocol uses less bandwidth because only the latest changes get transferred.
3. Aggregation Gossip Protocol
This variant creates a system-wide value by sampling data across each server. And then combining them.
How Gossip Protocol Spreads Messages
There are different ways to spread gossip messages. So it should be chosen based on the service needs and available network conditions.
The 3 ways to spread gossip messages are:
1. Push Model
The server with the newest message sends it to a random set of servers.
The push model is efficient if there are only a few messages because it avoids the traffic overhead.
2. Pull Model
Each server actively polls a random set of servers for newer messages.
The pull model is efficient if there are many new messages.
3. Push-Pull Model
The server pushes the newest messages and also polls for newer messages.
The push model is efficient in the initialization phase when there are only a few active servers.
While the pull model becomes efficient if there are many active servers.
Gossip Protocol Use Cases
The gossip protocol can be used to implement:
First-in-first-out (FIFO) broadcast
Causality broadcast
Total order broadcast
Some of the popular use cases of gossip protocol are:
Spreading server state across the system in Amazon S3
Detecting failures and tracking server membership in Amazon Dynamo
Propagating server metadata in the Redis cluster
Spreading the nonce value across the mining servers in Bitcoin
Electing the leader and detecting agent failures in Consul
Transferring consistent hash ring state in Riak database
Gossip Protocol Advantages
The advantages of gossip protocol are:
Fault-tolerant: messages flow via many routes making it tolerant against unreliable networks
Scalable: each server interacts with a limited number of servers and sends a fixed number of messages. Also a server doesn’t wait for an acknowledgement
Decentralized: it uses a peer-to-peer communication model
Gossip Protocol Disadvantages
The disadvantages of gossip protocol are:
Eventually consistent: it’s slower compared to multicast. Also gossip behavior depends on network topology
Difficult to debug and test: non-deterministic and distributed nature makes it hard to debug and reproduce failures
High bandwidth usage: A single message will get sent many times across different servers
The takeaway is to avoid gossiping in the real world. But gossip in the world of distributed systems when eventual consistency is fine.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/2007PromiseAndLimitations.pdf
https://systemdesign.one/gossip-protocol/