When Things Go Wrong in the Cloud It’s Hard to Look Where for Help

AzureStatus4Sep-2

Azure Active Directory Errors Stops Office 365 Dead

Office 365 has had a good run in terms of reliability and we might all have become too accustomed to assuming that the cloud runs without a hitch. That idea was challenged on September 4, when starting at 9:09 UTC (according to Office 365) or 9:29 UTC (according to Azure), a problem developed in the South Central U.S. datacenter region. According to a status posted around 18:00 UTC, a cooling problem in the datacenter caused by a power voltage increase due to lightning strikes during severe weather, led to a temperature spike that invoked automated datacenter procedures to power down equipment to protect data and hardware.

Early on in the incident, the Azure status report noted that “some services may also be experiencing intermittent authentication issues due to downstream Azure Active Directory impact...” The intermittent nature of the problem seemed to be due to overload conditions that occurred when Microsoft rerouted traffic to other datacenters. According to Microsoft’s Alex Simons (VP of program management for the identity division), availability for U.S. tenants dropped to 70% due to the load. Simons said that tenants outside the U.S. were unaffected, but they were.

AzureStatus4Sep
Azure status says AAD has a problem

Frustratingly, the Azure status page delivered many 500 (time-out) and 502 (invalid response) errors during the outage, probably due to customer demand for updates.

Microsoft 365 Communications Can Improve

Communication with Office 365 tenants wasn’t as smooth as it could and should have been either. Although the Office 365 status page (used to communicate with customers when the Admin portal might be offline) stayed online, it gave far less information than was available on the Azure status page. The difference in communication didn’t give a sense of a truly connected service.

O365Outage4Sept
Office 365 Status

Effects from a U.S. Datacenter Failure Spread to Europe

Later, the Office 365 update changed to say that the scope of the impact was:

This issue could potentially affect any of your users who are hosted out of the San Antonio data center. Impact is specific to a subset of users who are served through the affected infrastructure.”

My Office 365 tenant is in the EMEA datacenter region and San Antonio is in Texas, yet I suffered authentication issues. The problem was more apparent with Outlook than with browser apps, which made me think that the problem was associated with some component located outside EMEA. The first clue was that the MFA secondary authentication (in my case, entering a number sent by SMS) failed every time Outlook tried to connect. On the other hand, OWA, Teams, SharePoint Online, OneDrive for Business, and Planner all worked for me. Others experienced problems.

Further investigation revealed that United States-based datacenters deliver Azure Active Directory to EMEA Office 365 tenants. The U.S. handles some of the authentication workload for EMEA, including MFA authentication. Here’s the relevant text:

Two-factor authentication and its related personal data might be stored in the U.S. if you’re using MFA or SSPR.

  • All two-factor authentication using phone calls or SMS might be completed by U.S. carriers.
  • Push notifications using the Microsoft Authenticator app require notifications from the manufacturer’s notification service (Apple or Google), which might be outside Europe.
  • OATH codes are always validated in the U.S.

This accounts for why some services ran without a problem and some failed. Other Azure-based services linked to AAD are also resident in the U.S. and these also ran into problems.

What the Office 365 Admin Center Said

The information available in the Office 365 Admin Center wasn’t any better. Indeed, this report cited a different start time (12:52 UTC). The Admin Center gradually added information about services that might be affected, including Exchange, Power BI, SharePoint, Teams, and Intune.

O365Outage4Sept-2
Office 365 Admin Center reports the same problem

Twitter

The official Office 365 status Twitter account didn’t help much either as all it did was refer people to the Office 365 Admin Center or status.office.com, much to the frustration of people, some of whom couldn’t access the Admin Center because of the authentication problems.

O365Outage4Sept-3
Guidance from Microsoft on Twitter

Overall, the communications channels within the different strands of Microsoft 365 didn’t line up well on this occasion. It took several hours before Microsoft was able to communicate what was going on.

Problems Ease

The effects of the outage eased for me from about 15:30 UTC. However, I heard from other EMEA tenants that they experienced problems for longer. This might have been due to peaks in demand caused when users found they could authenticate again.

At the time of writing (8:30 UTC Sept 5), the status shows that Microsoft’s steps to restore service have succeeded and they are validating that full recovery has occurred.

[Update: Some issues were experienced as U.S. users came online on September 5 and load increased on the Azure datacenters. At 18:40 UTC, things seem to have settled down again as mitigation steps take effect.]

Moving Infrastructure is Hard

I can’t imagine what it must be like to have to restore power to and reset parts of a massive infrastructure like Office 365 and Azure while also trying to cope with the volume of user traffic that keeps on flowing into a datacenter. Remember, Office 365 has 135 million monthly active users, and this problem happened at peak time in Europe and extended into peak time in the U.S. (just after people returned from the Labor Day holiday too).

Azure and Office 365 are Closely Connected

The incident is a reminder of the complex and interconnected nature of the Azure And Office 365 infrastructure and how big a dependency Office 365 has on Azure Active Directory. We’ve experienced this before in June 2014 and December 2015 when Azure Active Directory outages affected services in the U.S. and Europe for many hours.

The cloud world is built from multiple services. Incidents like this show how problems rippled across datacenters to affect components in ways that you might not imagine. In my case, a problem in a U.S. datacenter reached across the Atlantic and made my day worse than it needed to be, along with many other Office 365 users in the U.S., Mexico, and Europe. But that’s life in the cloud. I guess I should be used to it now.

[Here’s an article by Aidan Finn that assesses the outage from an Azure perspective]

It’s important to emphasize that outages are a fact of IT life. Outages for cloud services occur less often than in most on-premises environments but when they happen, they affect more people. That’s another fact of cloud life. Another thing to remember is that the vast majority of Office 365 tenants remained blissfully unaware that bad weather caused cooling problems for a Texan datacenter.

Even if you can’t do much about it when it happens and rely on the cloud provider to restore service as soon as they can, outages don’t make a cloud service any more or less desirable. You accept the ups and downs of whatever platform you decide to use for your work. Office 365 and Azure have good reliability records and comfortably meet their SLA targets. Apart from making Microsoft think how they can deal with rerouting workload following lightning strikes on datacenters better and communicate with customers in a more coherent and comprehensive manner, this incident won’t change anything.

Root Cause Analysis

Update: in mid-September, Microsoft published their preliminary root cause analysis on the Azure Status History page. Basically, their datacenter processes were overwhelmed by a sudden increase in temperatures that forced hardware to go offline in an uncontrolled manner. The resulting corruption caused problems getting services back online, and some of those services affected users outside the U.S. Microsoft#s summary is posted below:

Summary of impact: In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region. Multiple Azure datacenters in the region saw voltage sags and swells across the utility feeds. At 08:42 UTC, lightning caused electrical activity on the utility supply, which caused significant voltage swells.  These swells triggered a portion of one Azure datacenter to transfer from utility power to generator power. Additionally, these power swells shutdown the datacenter’s mechanical cooling systems despite having surge suppressors in place. Initially, the datacenter was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the datacenter temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated. This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the datacenter that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units.
While storms were still active in the area, onsite teams took a series of actions to prevent further damage – including transferring the rest of the datacenter to generators thereby stabilizing the power supply. To initiate the recovery of infrastructure, the first step was to recover the Azure Software Load Balancers (SLBs) for storage scale units. SLB services are critical in the Azure networking stack, managing the routing of both customer and platform service traffic. The second step was to recover the storage servers and the data on these servers. This involved replacing failed infrastructure components, migrating customer data from the damaged servers to healthy servers, and validating that none of the recovered data was corrupted. This process took time due to the number of servers damaged, and the need to work carefully to maintain customer data integrity above all else. The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication.
Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter. Unfortunately, this particular set of issues also caused a cascading impact to services outside of the region.

Office 365 for IT Pros can’t solve datacenter problems, but we can give you something useful to read while you’re waiting for the next update from Microsoft.

Advertisements

One Reply to “When Things Go Wrong in the Cloud It’s Hard to Look Where for Help”

  1. Nice analysis with an informed and balanced treatment of the outage yesterday. Spot on conclusions too, in my opinion.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.