I woke up today to check one of my websites hosted on Microsoft Azure to see a blank page staring at me. I did some preliminary check to make sure I did not break the site over night. Then, I turned to Twitter to see if there’s anything trending about Azure, of course there I found all production services and websites hosted on Azure in the South Central US were down. I tried to access the Azure portal that did not work either, so I could try out anything to remedy the problem.
Below are the updates on the progress for the issue:
Microsoft Azure – Impacted service(s)
Network Infrastructure; Azure Active Directory; SQL Database; Storage; App Service
Impacted region(s)
South Central US; Global
Last update (59 min ago)
CUSTOMER IMPACT: There are currently three identified impact workstreams:
1) Customers with resources in South Central US may experience difficulties connecting to resources hosted in this region. A complete list of impacted services can be found below.
2) Customers using non-regional services, such as Azure Active Directory, may experience intermittent authentication failures in any region.
3) Customers may encounter errors when provisioning new subscriptions.
PRELIMINARY ROOT CAUSE:
1) A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.
2) As a result, non-regional services, such as Azure Active Directory, encountered an operational threshold for processing requests through the South Central US datacenter. Initial attempts to fail over into other datacenters resulted in temporary traffic congestion for those regions.
ENGINEERING STATUS: Engineers continue to implement the necessary mitigation steps. They have outlined a tentative mitigation workflow:
1) Restore power to the South Central US datacenter (COMPLETED)
2) Recover software load balancers for Azure Storage scale units in South Central US (COMPLETED)
3) Recover impacted Azure Storage scale units in South Central US (In Progress)
4) Recover the remaining Storage-dependent services in South Central US (In Progress)
This mitigation workflow is tentative and subject to change as events develop.
NEXT UPDATE: The next update will be provided by 02:00 UTC 05 Sep 2018 or as events warrant.
Last, I thought I’d share this tweet sent by someone who is also affected by this outage.