Microsoft Provides the Preliminary Report on its September 4 Cloud Outage

Microsoft will be crediting customers affected by its September 4 cloud outage and will work to improve its cloud storage resiliency, according to its preliminary analysis of the incident.

Microsoft has made available publicly a preliminary root cause analysis (RCA) for its September 4 cloud outage that impacted customers worldwide. The Azure engineering teams are continuing to investigate the incident and are saying they will provide a more detailed analysis "in the weeks ahead."

Impacted customers will receive a credit based on the Microsoft Azure Service Level Agreement in their October billing statements, Microsoft officials said in the post-mortem report.

On September 4, as I blogged originally, a lightning strike hit near Microsoft's South Central US datacenter region, knocking out a number of Azure services, as well as Office 365, which authenticates via Azure Active Directory, for many Microsoft customers worldwide.

Microsoft's post-mortem summary noted that the storm caused "electrical activity on the utility supply, which caused significant voltage swells." These swells caused some of one Azure datacenter to transfer to generator power and shut down the data center's cooling systems even though there were surge suppressors in place. The datacenter still maintained required operational temperatures through a load-dependent thermal buffer in the cooling system, but once that buffer was depleted, temperatures went up and an automated showdown of devices was initiated.

Some hardware was damaged before it could shut down, including a "significant number of storage servers" and other network devices and power units. Onsite teams began attempts to recover the infrastructure, which meant replacing failed hardware, migrating servers to healthy servers and validating that data wasn't corrupted.

For those wondering why Microsoft's datacenter didn't failover to a backup site: "The decision was made to work towards recovery of data and not fail over to another datacenter since a failover would have resulted in limited data loss due to the asynchronous nature of geo-replication," officials explained in the post.

The shutdown of the datacenter impacted many Azure services that depended on the storage servers in that datacenter. Among the services hit: storage, Virtual Machines, Application Insights, Cognitive Services & Custom Vision API, Backup, App Service (and App Services for Linux and Web App for Containers), Azure Database for MySQL, SQL Database, Azure Automation, Site Recovery, Redis Cache, Cosmos DB, Stream Analytics, Media Services, Azure Resource Manager, Azure VPN gateways, PostgreSQL, Application Insights, Azure Machine Learning Studio, Azure Search, Data Factory, HDInsight, IoT Hub, Analysis Services, Key Vault, Log Analytics, Azure Monitor, Azure Scheduler, Logic Apps, Databricks, ExpressRoute, Container Registry, Application Gateway, Service Bus, Event Hub, Azure Portal IaaS Experiences- Bot Service, Azure Batch, Service Fabric and Visual Studio Team Services (VSTS).

Microsoft says "the vast majority of these services were mitigated by 11:00 UTC on September 5," but acknowledges full mitigation didn't happen until 8:40 on September 7.

Why were customers outside the U.S. South Central region also affected by this series of events? According to the post, there was "insufficient resiliency for Azure Service Manager," the operations-management service for "classic" resource types. "Although ASM is a global service, it does not support automatic failover," Microsoft execs said. And Azure Resource Manager services outside the South Central region also were impacted due to various dependencies on ASM and other related services.

Azure Active Directory also was impacted, officials said, due to authentication traffic from the shut-down datacenter being routed to other sites, coupled with an increased rate in authentication requests. The post details what went wrong with VSTS, Azure Application Insights, and other key services during that series of events in early September.

Microsoft execs said they apologize to affected customers and are looking for ways to improve architectural resiliency after this event. The company is doing a detailed forensic analysis of the impacted data center hardware and systems; a review of every internal service with dependencies on the Azure Service Manager; an investigation of the possibility of moving these ASM-dependent services to Azure Resource Manager; and an evaluation of future hardware design of storage units to increase resiliency.

SOURCE: