Everything You Need to Know About Gossip Protocol

#25: Learn More - How Distributed Systems Gossip (6 minutes)

Neo Kim

Nov 28, 2023

Get my system design playbook for FREE on newsletter signup:

Share this post & I'll send you some rewards for the referrals.

The 2 main problems in distributed systems are state management and communication.

A peer-to-peer service like gossip protocol can be used to solve them.

The gossip protocol handles system state with high availability.

Also application-level data can be piggybacked in gossip messages as key-value pairs.

The gossip protocol is also called the epidemic protocol. Because the messages get transferred like how epidemics spread.

Break into tech in 6 weeks 💻 (Featured)

Learn tech skills in 6 weeks to create a portfolio and land a remote job.

Try it

Why Use Gossip Protocol?

There are different ways to broadcast a message in distributed systems. They are:

1. Point-To-Point Broadcast

The producer sends the messages directly to the consumer. Also the producer retries if the consumer fails to accept the message.

Yet the message will be lost if both the producer and consumer fail simultaneously.

2. Eager Reliable Broadcast

Each server broadcasts a message to every other server in the system.

Although it’s fault-tolerant, this approach is problematic because:

High bandwidth usage due to O(n^2) messages broadcast to n number of servers
Network bottleneck due to O(n) linear broadcast
Extra storage is needed to maintain the list of nodes

3. Gossip Protocol

Each server periodically sends the messages to a set of random servers. And the entire system will receive a message eventually.

Gossip protocol is a good choice for communication in a large-scale system because:

Each server transfers only a limited number of messages
Limited bandwidth usage
Tolerant to network and server failures

The gossip protocol is reliable because many servers retransmit the messages.

Yet gossip protocol can be used to keep nodes consistent only if:

Operations are commutative
Serializability is not needed

The number of servers that receive a message from a particular server is called the Fanout.

The number of gossip rounds needed to transfer a specific message across the entire system is called the Cycle.

A case study of a gossiping system with 128 servers needed less than 2 percent of CPU and 60 KBps of bandwidth.

Here are some gossip protocol simulations:

Gossip Protocol Properties

There is no formal definition for gossip protocol. But it’s expected to have certain properties:

A peer server must be selected randomly
Each server stores only location information. And is unaware of the entire system state
Interactions between servers are periodic and pairwise

Gossip Algorithm

Each server maintains a list of servers and their metadata.

Here is how the gossip algorithm works:

Gossip periodically to a random server
The server inspects the received gossip message
The server merges the message with the highest version to local data

Gossip Protocol Implementation

The gossip protocol uses the peer sampling service to find the peer servers.

Here is how the peer sampling service works:

Initialize each server with a partial view of the system
Merge the server’s view with a peer server’s view on gossip exchange

A server initiating a gossip exchange sends a gossip digest synchronization message. It contains a list of gossip digests.

Here is a sample schema of gossip digest:

EndPointState:
10.0.1.41

HeartBeatState: 
generation: 1259904231, version: 681

ApplicationState: 
"average-load": 2.7, generation: 3659909691, version: 42

ApplicationState: 
"bootstrapping": pxLpassF9XD8Kymj, generation: 1281909615, version: 91

The gossip messages get sent over User Datagram Protocol (UDP) or Transmission Control Protocol (TCP).

A server is considered healthy if the heartbeat counter keeps incrementing. So the heartbeat counter of a server is incremented on each gossip exchange.

The gossip protocol removes the data from a server using a tombstone. The tombstone is a special data entry to invalidate a data key without the actual deletion of the data.

Gossip Protocol Types

There are 3 types of gossip protocols. They are categorized based on message transfer time and the network traffic created.

1. Anti-Entropy Gossip Protocol

This variant sends an unbounded number of messages without termination.

It’s usually used to reduce the entropy between replicas of a stateful service like the database. The server with the newest message sends it to other servers.

Yet it causes high bandwidth usage due to the transfer of the entire dataset.

2. Rumor-Mongering Gossip Protocol

This variant’s cycle is more frequent compared to the anti-entropy cycle. So it will likely flood the network.

Yet rumor-mongering protocol uses less bandwidth because only the latest changes get transferred.

3. Aggregation Gossip Protocol

This variant creates a system-wide value by sampling data across each server. And then combining them.

How Gossip Protocol Spreads Messages

There are different ways to spread gossip messages. So it should be chosen based on the service needs and available network conditions.

The 3 ways to spread gossip messages are:

1. Push Model

The server with the newest message sends it to a random set of servers.

The push model is efficient if there are only a few messages because it avoids the traffic overhead.

2. Pull Model

Each server actively polls a random set of servers for newer messages.

The pull model is efficient if there are many new messages.

3. Push-Pull Model

The server pushes the newest messages and also polls for newer messages.

The push model is efficient in the initialization phase when there are only a few active servers.

While the pull model becomes efficient if there are many active servers.

Gossip Protocol Use Cases

The gossip protocol can be used to implement:

First-in-first-out (FIFO) broadcast
Causality broadcast
Total order broadcast

Some of the popular use cases of gossip protocol are:

Spreading server state across the system in Amazon S3
Detecting failures and tracking server membership in Amazon Dynamo
Propagating server metadata in the Redis cluster
Spreading the nonce value across the mining servers in Bitcoin
Electing the leader and detecting agent failures in Consul
Transferring consistent hash ring state in Riak database

Gossip Protocol Advantages

The advantages of gossip protocol are:

Fault-tolerant: messages flow via many routes making it tolerant against unreliable networks
Scalable: each server interacts with a limited number of servers and sends a fixed number of messages. Also a server doesn’t wait for an acknowledgement
Decentralized: it uses a peer-to-peer communication model

Gossip Protocol Disadvantages

The disadvantages of gossip protocol are:

Eventually consistent: it’s slower compared to multicast. Also gossip behavior depends on network topology
Difficult to debug and test: non-deterministic and distributed nature makes it hard to debug and reproduce failures
High bandwidth usage: A single message will get sent many times across different servers

The takeaway is to avoid gossiping in the real world. But gossip in the world of distributed systems when eventual consistency is fine.

Consider subscribing to get simplified case studies delivered straight to your inbox: