This article is a chapter from my new book Explain the Cloud Like I'm 10. The first release was written specifically for cloud newbies. I've made some updates and added a few chapters—Netflix: What Happens When You Press Play? and What is Cloud Computing?—that level it up to a couple ticks past beginner. I think even fairly experienced people might get something out of it.
I've also created a somewhat expanded version of the article in a standalone Kindle ebook. You can find the ebook at Netflix: What Happens When You Press Play?
So if you are looking for a good introduction to the cloud or know someone who is, please take a look. I think you'll like it. I'm pretty proud of how it turned out.
I pulled this chapter together from dozens of sources that were at times somewhat contradictory. Facts on the ground change over time and depend who is telling the story and what audience they're addressing. I tried to create as coherent a narrative as I could. If there are any errors I'd be more than happy to fix them. Keep in mind this article is not a technical deep dive. It's a big picture type article. For example, I don't mention the word microservice even once :-)
Netflix seems so simple. Press play and video magically appears. Easy, right? Not so much.
Given our discussion in the What is Cloud Computing? chapter, you might expect Netflix to serve video using AWS. Press play in a Netflix application and video stored in S3 would be streamed from S3, over the internet, directly to your device.
A completely sensible approach…for a much smaller service.
But that’s not how Netflix works at all. It’s far more complicated and interesting than you might imagine.
To see why let’s look at some impressive Netflix statistics for 2017.
- Netflix has more than 110 million subscribers.
- Netflix operates in more than 200 countries.
- Netflix has nearly $3 billion in revenue per quarter.
- Netflix adds more than 5 million new subscribers per quarter.
- Netflix plays more than 1 billion hours of video each week. As a comparison, YouTube streams 1 billion hours of video every day while Facebook streams 110 million hours of video every day.
- Netflix played 250 million hours of video on a single day in 2017.
- Netflix accounts for over 37% of peak internet traffic in the United States.
- Netflix plans to spend $7 billion on new content in 2018.
What have we learned?
Netflix is huge. They’re global, they have a lot of members, they play a lot of videos, and they have a lot of money.
Another relevant factoid is Netflix is subscription based. Members pay Netflix monthly and can cancel at any time. When you press play to chill on Netflix, it had better work. Unhappy members unsubscribe.
We’re Going Deep
Netflix is a terrific example of all the ideas we’ve talked about, which is why this chapter goes into a lot more detail than the other cloud services we’ve covered.
One big reason for diving deeper into Netflix is they make much more information available than other companies.
Netflix holds communication as a central cultural value. Netflix more than lives up to its standards.
In fact, I’d like to thank Netflix for being so open about their architecture. Over the years, Netflix has given hundreds of talks and written hundreds of articles on the inner-workings of how they operate. The whole industry is better for it.
Another reason for going into so much detail on Netflix is Netflix is just plain fascinating. Most of us have used Netflix at one time or another. Who wouldn’t love peeking behind the curtain to see what makes Netflix tick?
Netflix operates in two clouds: AWS and Open Connect.
How does Netflix keep their members happy? With the cloud of course. Actually, Netflix uses two different clouds: AWS and Open Connect.
Both clouds must work together seamlessly to deliver endless hours of customer-pleasing video.
The three parts of Netflix: client, backend, content delivery network (CDN).
You can think of Netflix as being divided into three parts: the client, the backend, and the content delivery network (CDN).
The client is the user interface on any device used to browse and play Netflix videos. It could be an app on your iPhone, a website on your desktop computer, or even an app on your Smart TV. Netflix controls each and every client for each and every device.
Everything that happens before you hit play happens in the backend, which runs in AWS. That includes things like preparing all new incoming video and handling requests from all apps, websites, TVs, and other devices.
Everything that happens after you hit play is handled by Open Connect. Open Connect is Netflix’s custom global content delivery network (CDN). Open Connect stores Netflix video in different locations throughout the world. When you press play the video streams from Open Connect, into your device, and is displayed by the client. Don’t worry; we’ll talk more about what a CDN is a little later.
Interestingly, at Netflix they don’t actually say hit play on video, they say clicking start on a title. Every industry has its own lingo.
By controlling all three areas—client, backend, CDN— Netflix has achieved complete vertical integration.
Netflix controls your video viewing experience from beginning to end. That’s why it just works when you click play from anywhere in the world. You reliably get the content you want to watch when you want to watch it.
Let’s see how Netflix makes that happen.In 2008 Netflix Started Moving to AWS
Netflix launched in 1998. At first they rented DVDs through the US Postal Service. But Netflix saw the future was on-demand streaming video.
In 2007 Netflix introduced their streaming video-on-demand service that allowed subscribers to stream television series and films via the Netflix website on personal computers, or the Netflix software on a variety of supported platforms, including smartphones and tablets, digital media players, video game consoles, and smart TVs.
On a personal note, that streaming video-on-demand was the future might seem obvious. And it was. I worked at a couple of startups that tried to make a video-on-demand product. They failed.
Netflix succeeded. Netflix certainly executed well, but they were late to the game, and that helped them. By 2007 the internet was fast enough and cheap enough to support streaming video services. That was never the case before. The addition of fast, low-cost mobile bandwidth and the introduction of powerful mobile devices like smart phones and tablets, has made it easier and cheaper for anyone to stream video at any time from anywhere. Timing is everything.Netflix Began by Running their Own Datacenters
EC2 was just getting started in 2007, about the same time Netflix’s streaming service started. There was no way Netflix could have launched using EC2.
Netflix built two datacenters, located right next to each other. They experienced all the problems we talked about in earlier chapters.
Building out a datacenter is a lot of work. Ordering equipment takes a long time. Installing and getting all the equipment working takes a long time. And as soon they got everything working they would run out of capacity, and the whole process had to start over again.
The long lead times for equipment forced Netflix to adopt what is known as a vertical scaling strategy. Netflix made big programs that ran on big computers. This approach is called building a monolith. One program did everything.
The problem is when you’re growing really fast like Netflix; it’s very hard to make a monolith reliable. And it wasn’t.A Service Outage Caused Netflix to Move to AWS
For three days in August 2008, Netflix could not ship DVDs because of corruption in their database. This was unacceptable. Netflix had to do something.
The experience of building datacenters had taught Netflix an important lesson—they weren’t good at building datacenters.
What Netflix was good at was delivering video to their members. Netflix would rather concentrate on getting better at delivering video rather than getting better at building datacenters. Building datacenters was not a competitive advantage for Netflix, delivering video is.
At that time, Netflix decided to move to AWS. AWS was just getting established, so selecting AWS was a bold move.
Netflix moved to AWS because it wanted a more reliable infrastructure. Netflix wanted to remove any single point of failure from its system. AWS offered highly reliable databases, storage and redundant datacenters. Netflix wanted cloud computing, so it wouldn’t have to build big unreliable monoliths anymore. Netflix wanted to become a global service without building its own datacenters. None of these capabilities were available in its old datacenters and never would be.
A reason Netflix gave for choosing AWS was it didn’t want to do any undifferentiated heavy lifting. Undifferentiated heavy lifting are those things that have to be done, but don’t provide any advantage to the core business of providing a quality video watching experience. AWS does all the undifferentiated heavy lifting for Netflix. This lets Netflixians focus on providing business value.
It took more than eight years for Netflix to complete the process of moving from their own datacenters to AWS. During that period Netflix grew its number of streaming customers eightfold. Netflix now runs on several hundred thousand EC2 instances.Netflix is More Reliable in AWS
It’s not like Netflix has never experienced down time on AWS, but on the whole, its service is much more reliable than it was before.
You don’t see complaints like this very often anymore:
Netflix is so reliable now because they’ve taken extraordinary steps to make their service reliable.
Netflix operates out of three AWS regions: one in North Virginia, one in Portland Oregon, and one in Dublin Ireland. Within each region, Netflix operates in three different availability zones.
Netflix has said there are no plans to operate out of more regions. It’s very expensive and complicated to add new regions. Most companies operate out of just one region, let alone two or three.
The advantage of having three regions is that any one region can fail, and the other regions will step in handle all the members in the failed region. When a region fails, Netflix calls this evacuating a region.
Let’s use an example. Let’s say you’re watching a new House of Cards episode in London England. Because it’s closest to London, chances are your Netflix device is connected to the Dublin region.
What happens if the entire Dublin region fails? Does that mean Netflix should stop working for you? Of course not!
Netflix, after detecting the failure, redirects you to Virginia. Your device would now talk to the Virginia region instead of Dublin. You might not even notice there was a failure.
How often does an AWS region fail? Once a month. Well, a region doesn’t actually fail every month. Netflix runs monthly tests. Every month Netflix causes a region to fail on purpose just to make sure its system can handle region level failures. A region can be evacuated in six minutes.
Netflix calls this their global services model. Any customer can be served out of any region. This is amazing. And it doesn’t happen automatically. AWS has no magic sauce for handling region failures or serving customers out of multiple regions. Netflix has done all this work on its own. Netflix is a pioneer in figuring out how to create reliable systems using multiple regions. I’m not aware of any other company that goes to these lengths to make their service so reliable.
Another advantage of being in these three regions is that it gives Netflix world-wide coverage. Netflix ran some tests and found if you use a Netflix application anywhere in the world, you’ll get fast service from one of these three regions.Netflix Saves Money in AWS
This may surprise a lot of people, but AWS is cheaper for Netflix. The cloud costs per streaming view ended up being a fraction of the cost of its old datacenters.
Why? The elasticity of the cloud.
Netflix could add servers when it needed them and return them when it didn’t. Rather than have a lot of extra computers hanging around doing nothing just to handle peak load, Netflix only had to pay for what was needed, when it was needed.
All the stuff we talked about in the What is Cloud Computing? chapter.What Happens in AWS Before you Press Play?
Anything that doesn’t involve serving video is handled in AWS.
This includes scalable computing, scalable storage, business logic, scalable distributed databases, big data processing and analytics, recommendations, transcoding, and hundreds of other functions.
Don’t worry, you don’t need to understand what all those things are, but since you may find it interesting, I’ll explain them briefly.
Scalable computing and scalable storage.
Scalable computing is EC2 and scalable storage is S3. Nothing new for us here.
Your Netflix device—iPhone, TV, Xbox, Android phone, tablet, etc.—talks to a Netflix service running in EC2.
View a list of potential videos to watch? That’s your Netflix device contacting a computer in EC2 to get the list.
Ask for more details about a video? That’s your Netflix device contacting a computer in EC2 to get the details.
It’s just like all the other cloud services we’ve talked about in the book.
Scalable distributed database.
Netflix uses both DynamoDB and Cassandra for their distributed databases. Not that these names should mean anything to you, they’re just high-quality database products.
Database. A database stores data. Your profile information, your billing information, all the movies you’ve ever watched, all that kind of information is stored in a database.
Distributed. Distributed means the database doesn’t run on one big computer, it runs on many computers. Your data is copied to multiple computers so if one or even two computers holding your data fail, your data will be safe. In fact, your data is copied to all three regions. That way, if a region fails your data will be there when the new region is ready to start using it.
Scalable. Scalable means the database can handle as much data as you ever want to put into it. That’s one major advantage of being a distributed database. More computers can be added as necessary to handle more data.
Big data processing and analytics.
Big data simply means there’s a lot of data. Netflix collects a lot of information. Netflix knows what everyone has watched when they watched it and where they were when they watched. Netflix knows which videos members have looked at but decided not to watch. Netflix knows how many times each video has been watched…and a lot more.
Putting all the data in a standard format is called processing.
Making sense of all that data is called analytics. Data is analyzed to answer specific questions.
Netflix personalizes artwork just for you.
Here’s a great example of how Netflix entices you to watch more videos using its data analytics capabilities.
When browsing around looking for something to watch on Netflix, have you noticed there’s always an image displayed for each video? That’s called the header image.
The header image is meant to intrigue you, to draw you into selecting a video. The idea is the more compelling the header image, the more likely you are to watch a video. And the more videos you watch, the less likely you are to unsubscribe from Netflix.
Here’s an example of different header images for Stranger Things: