Discover more from System Design Newsletter
How Khan Academy Scaled to 30 Million Users
#39: Break Into Khan Academy Architecture (5 minutes)
Get the powerful template to approach system design for FREE on newsletter sign-up:
This post outlines how Khan Academy scaled to 30 million users. If you want to learn more, scroll to the bottom and find the references.
Consider sharing this post with someone who wants to study system design.
August 2004 - Boston, United States.
Sal Khan works for a hedge fund.
One day he receives a phone call from a cousin.
His cousin needs help to learn mathematics.
So he started tutoring her through phone and Yahoo Doodle after work.
And seasons passed.
Sal started tutoring many more of his cousins.
But he faced problems with time management.
So he started recording videos and posted them on YouTube.
The rest is history.
Khan Academy Architecture
Here’s how Khan Academy achieved extreme scalability:
1. Keep It Simple and Stupid
They used YouTube to serve videos to students. Because it offers reduced costs and performance.
Also they set up a fallback to serve videos from Amazon S3 via Fastly CDN. It solved the problem with institutions where YouTube isn’t allowed.
In other words, S3 stores the videos while Fastly caches them for subsequent views.
Amazon Simple Storage Service (S3) is an object storage. It stores unstructured data without hierarchy. While Content Delivery Network (CDN) is a globally distributed server for low latency.
Besides they use serverless architecture for scalability. And runs a content management system (CMS) with a versioned content store. It’s used to handle dynamic behavior like tracking the progress of students.
2. Essential Complexity Is Unavoidable
They started with a monolithic architecture for simplicity.
But moved to a microservices architecture and runs around 20 services.
Although microservices increase the system complexity, they offer the following benefits:
Faster deployments and test runs due to independent small services
Reduced blast radius on deployment failures
Easy to choose the instance type and hosting configuration
Easy to optimize the services for performance and cost
Each microservice is responsible for its data. While a service can only access other services' data via a GraphQL API.
3. Don’t Reinvent the Wheel
They didn't buy more servers or scale the operations team. Instead run their infrastructure on Google Cloud. And used serverless on Google App Engine for scalability.
So they could focus more on the application logic and worry less about scalability. Thus avoiding reinventing the wheel.
Also they used Fastly CDN for low latency and reduced server load.
And GraphQL for APIs because it provides efficiency and flexibility.
4. Pick Your Battles Wisely
They used Python in the early days to stay productive.
But aimed at lowering costs, better performance, and high reliability.
So they moved to the Go programming language. It offers fast compile time and uses less memory. Also there's good support for concurrency.
Besides they use Google Cloud Datastore as the database for performance and scalability. Google Cloud Datastore is a fully managed NoSQL database.
They use Google App Engine because it’s a fully managed environment. And it lets them scale easily without extra operational efforts.
In other words, they offloaded the scaling part to focus more on creating educational content.
5. Don’t Repeat Yourself
They use Fastly CDN to cache static data for scalability and performance.
Also every request goes through Fastly CDN to prevent unwanted server traffic.
Besides they cache common queries, user preferences, and session data using Memcache. And it allowed them to reduce server costs.
6. People Are the Most Important Asset
They embraced a common goal as a team - either succeed or fail together. And didn't try to find perfection in every change but did only what was enough.
Also they eliminated scope creep at every cost. And prevented adding new work while in the middle of something.
Each engineer worked on different product areas to increase the sense of ownership.
Besides they celebrated each milestone because humans grow on a sense of accomplishment.
7. Obey Conway's Law
Conway's law is a software development principle. It says that the software structure will be similar to the communication structure of the organization.
They kept an open communication channel between team members for better decision-making.
Also they used Architecture Decision Records (ADR) to communicate design decisions. Because it’s easier to provide details about design choices and change them later if needed.
ADR is a document that describes how a software or a system is designed and organized. Think of it like a blueprint for software.
Khan Academy is a nonprofit organization.
They served 30 million students in April 2020 with a simple tech stack and architecture.
And became the most used online resource during the 2020 distance learning. They offer quality content and classroom experience for millions of students for free.
Consider subscribing to get simplified case studies delivered straight to your inbox:
Thank you for supporting this newsletter. Consider sharing this post with your friends and get rewards. Y’all are the best.
References
Subscribe to System Design Newsletter
A weekly newsletter to help you pass system design interview and become good at work
Awesome read, I would have never guessed it’s a non profit organisation. The legal paperwork around non profits scare off many.
Thanks for sharing the case study, Neo.
We often think these big systems started being complex, but it's inspiring to see how they started with YouTube or Levels.fyi with Google sheets