Distributed Learning Technologies

Service Disruption 1-9-2019

Reported: January 9th, 2019 9:04 AM PT

Resolved: January 9th, 2019 10:50 AM PT

Symptom

Users attempting to navigate to the zoom.us website experienced slow connections or the inability to load pages. Additionally, users attempting to connect to Zoom meetings or webinars were unable to do so.

Root Cause

To scale globally, Zoom has developed a geographically distributed platform for delivery of services, which is housed across Zoom’s 13 data centers and other well-established cloud service providers such as Amazon. All real-time media such as voice and video is handled through Zoom’s data centers while data storage and service such as the Zoom website are hosted within a cloud service provider.


At 9:04 AM PT, Zoom’s internal monitoring tools began observing an issue at the data layer exhibited as a high connection load to a key database within the primary Virginia region in Amazon Web Services (AWS). Based on this monitoring, an automated failover to a secondary database within the region but in a different zone was executed in an attempt to restore service. However, the backup database exhibited the same behavior and was also flagged by the monitoring system, triggering a high level failover to another backup system in a geographically separate AWS region located in California. While the automated failover routine executed, Zoom’s operation engineers began to work with Amazon to understand the trigger for the behavior pattern, which was identified as a hardware failure of the underlying storage supporting the backup database in this high availability pair. Unfortunately, a simultaneous failure of both nodes occurred in connection with the event, thus making the highly available service offering from Amazon unavailable for Zoom’s instance at that time. Though this database failure caused issues within the data layer of services, other platform components that did not need to make requests to this portion of the data layer were not impacted. For example, Zoom’s chat services and existing meetings and webinars (sessions that had started before 9:04 AM) remained unimpacted.

The failover to the backup AWS location in California was completed by 9:13 AM PT. Monitoring services concluded that database services were operational and stable at this location. However, it was found that the California location experienced delays in the local instance of a Zoom platform component. Since all AWS service were stable in California, Zoom’s operation and software engineers immediately began troubleshooting to determine the source of the performance issue. While troubleshooting that service, the Zoom operations team was informed that the high availability pair in Virginia had been restored, and subsequently failed back over in an attempt to resolve the issue at 9:38 AM PT. Unfortunately, while the database was available, instability was observed. As before, services that did not need to make requests to this portion of the data infrastructure remained unimpacted.

Unable to restore service in Virginia (due to instability of the AWS database) or California (due to performance degradation of a Zoom component), at 9:53 AM PT the Zoom operations team activated a failover to a Zoom data center in New York and a special failover service designed for this type of scenario. This special failover service was designed to not rely on data services within AWS and would enable most functionality of the platform. However, this service quickly began to exhibit poor performance due to the high load and by 10:01 AM PT the operations team initiated a failover back to AWS in California.

Zoom’s operations team then worked in parallel to determine the quickest course to resolution, stabilizing the database in Virginia, optimizing the performance of the Zoom platform component in California, or adding additional resources to the service in New York.


Solution

Once the hardware issue with the backup database in the high availability pair in Virginia was addressed and the database was allowed to ramp up to the load, service was fully restored leveraging the AWS Virginia location and all services were returned to that location by 10:50 AM PT. Next, the performance issue within the California location was isolated at 12:30 PM PT to Zoom implementing an incorrect configuration setting and was subsequently changed to the proper value.

Prevention

First, Zoom will implement a daily end to end test of all failover facilities and services to ensure all failover options are fully tested and available in case they are needed in the future. Next, Zoom’s operations team will review current architecture and implementation, including the current design and the use of vendors to provide various services. Additionally, Zoom will perform more exhaustive platform testing to ensure that failures in specific portions of the service do not cause operational impact to other components.