During winter 2012, Netflix suffered an extended outage[1] that lasted for seven hours due to problems in the AWS Elastic Load Balancer service in the US-East region. (Netflix runs on Amazon Web Services [AWS]—we don't have any data centers of our own. All of your interactions with Netflix are served from AWS, except the actual streaming of the video. Once you click "play," the actual video files are served from our own CDN.) During the outage, none of the traffic going into US-East was reaching our services.
To prevent this from happening again, we decided to build a system of regional failovers that is resilient to failures of our underlying service providers. Failover is a method of protecting computer systems from failure in which standby equipment automatically takes over when the main system fails.
Regional failovers decreased the risk
We expanded to a total of three AWS regions: two in the United States (US-East and US-West) and one in the European Union (EU). We reserved enough capacity to perform a failover so that we can absorb an outage of a single region.
A typical failover looks like this:
- Realize that one of the regions is having trouble.
- Scale up the two savior regions.
- Proxy some traffic from the troubled region to the saviors.
- Change DNS away from the problem region to the savior regions.
1. Identify the trouble
We need metrics, and preferably a single metric, that can tell us the health of the system. At Netflix, we use a business metric called stream starts per second (SPS for short). This is a count of the number of clients that have successfully started streaming a show.
We have this data partitioned