Discover more from System Design Newsletter
Tumblr Shares Database Migration Strategy With 60+ Billion Rows
#3: Read Now - Amazing Database Migration Technique (4 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
Database migration gone wrong could result in data loss, downtime, and compatibility issues.
This post outlines the database migration techniques at Tumblr. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with somebody who wants to study system design.
Tumblr is a micro-blogging platform that allows users to publish short blog posts.
How to Migrate a MySQL Database?
The fact that Tumblr uses MySQL database to store its critical data is wild.
The MySQL database at Tumblr consumes 21 terabytes with 60+ billion relational rows. And there are over 200 database servers.
Spreading the database leaders across many data centers provides scalability and high availability. So, they decided to migrate the database leader between data centers.
A brute force approach to database migration.
The database leader ran in the remote data center. So, switch off the traffic toward the remote data center. Migrate the database leader from the remote to the local data center. And Route the traffic toward the local data center.
But this is a recipe for poor availability.
Tumblr wanted the user impact to be very minimal on database migration.
An optimal approach to database migration.
The command and query responsibility segregation (CQRS) pattern simplified their database migration. The CQRS is a pattern that separates read and write operations to a database.
The most common replication topology in MySQL is leader-follower. The database leader accepts read-write requests. The database followers accept read requests and replicate the leader.
The database leader ran in the remote data center. And they provisioned database followers in the local data center. The persistent connections routed the read-write requests to the database leader. The connection reuse over a large number of queries allowed to reduce latency.
But each application service had a different usage pattern. This resulted in frequent disconnections to the database leader. The connection establishment with the database was expensive. So, they ran into a latency problem.
As a workaround, they provisioned a database proxy (ProxySQL) in the local data center. The database proxy held persistent connections to the remote database leader.
The database proxy allowed database connection pooling. Database connection pooling is the method of holding database connections open for reuse by others. This approach prevented frequent disconnections and improved performance.
They wanted to migrate the database leader from remote data center A to B. The workflow of the database migration at Tumblr is as follows:
Each data center stored the metadata of database followers, the leader, and the database proxy
They migrated the database leader from remote data center A to B
An automation tool pointed all followers to the new leader
An automation tool pointed the database proxy to the new leader
The database followers accepted read requests during the database migration. This approach resulted in minimal user impact with seconds of read-only state. The write requests were either rejected or buffered for a few seconds.
How to further improve write availability during database migration?
The leader-leader replication topology in MySQL could improve write availability during database migration. But this approach introduces the risk of data conflicts. This might be the reason why Tumblr doesn’t use it.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
Word-of-mouth referrals like yours help this community grow - Thank you.
You can share your testimonial by posting on Twitter or LinkedIn. Then send me the link by replying to this email. Or, you can reply to this email with an anonymous testimonial.
References
Tumblr Engineering Team. (2016, October 13). Juggling Databases Between Datacenters. Tumblr Engineering Blog.
Engineering Team. (2012, June 8). Jetpants: A Toolkit for Huge MySQL Topologies. Tumblr Engineering Blog.
Atchison, L. (2019, November 26). Avoiding Downtime When Migrating Data to the Cloud. LinkedIn.
Microsoft. CQRS pattern. Microsoft Azure Architecture Patterns.
What is database pooling? (n.d.). In Stack Overflow.
Subscribe to System Design Newsletter
A weekly newsletter to help you pass system design interview and become good at work