Azure Active Directory and Office 365
In December 2015, I wrote an article asking if Azure Active Directory was becoming the Achilles Heel of Office 365. The article followed a significant Exchange Online outage in Western Europe, similar in some ways to the events of January 24-25. Users couldn’t connect to their mailboxes and those who could experienced some latency or slowness issues.
Unhappy Run of Events
Azure Active Directory has been having an unhappy run of outages recently, notably in September 2018 when a lightning strike in Texas caused issues for many services and in November 2018 when a problem with the Multi-Factor Authentication service made it unavailable in multiple Office 365 regions. Another authentication outage flowing from Azure Active Directory problems happened on January 29. It’s an unfortunate run of problems that underlines the truth that if you can’t authenticate, you can’t connect, and you can’t work.
Things don’t seem to have changed much since 2015, but in fact had you asked the question in August 2018 when calm had existed for several months, you might have received a different answer. Clouds change quickly and can turn on you…
What Happened in an Incident
Following the January 24-25 problems, Microsoft issued a Post Incident Report (PIR) to explain what happened in two separate but conjoined incidents (EX172564 and EX172491). The first is about a failure in capacity that affected approximately 1% of the users served by the Office 365 EMEA datacenter region and could not connect to their mailboxes. The other fault affected approximately 10% of the users, but not as seriously.
The figures are from the PIR. After years of monitoring Office 365, Microsoft’s telemetry is well developed, and I am inclined to accept their data. Based on the flow of reports about outages that flowed in, you might have thought that much more than 1% of users were affected. This reflects the natural inclination of people who are affected to protest while the majority who aren’t affected stay silent (they’re working).
The root cause is stated to be a Windows Server component that handles User Datagram Protocol (UDP) transactions caused a kernel lock to be held for an extended period and resulted in Domain Controllers to crash. The resulting load caused problems for the remaining domain controllers because the pool of available controllers couldn’t handle the load on the system.
All systems can experience problems if available capacity is reduced below the level of user demand. The PIR says that Microsoft is conducting an architectural review to understand if they need to deploy extra scalability and resiliency options. They’re also looking at the way the automated recovery worked inside Office 365 when a situation like this happens so the processes work better in the future.
I guess what happened is a unique condition that Microsoft had not designed for. What’s bad about this situation is that the weakness of Azure Active Directory to handle spikes in load caused when capacity drops continues to be a concern. Given the essential nature of Azure Active Directory to the Office 365 ecosystem, it seems like Microsoft could do more to manage spikes when things go wrong.
What Went Right
On the upside, the segmentation of resources inside Office 365 limited the effect of the problem. Inside of all European users being affected, only those users whose accounts were in the forests served by the failed domain controllers had a problem. If your account was in another forest (like mine), you didn’t have a problem. This is an example of how not putting all your eggs in the proverbial basket really is a good idea.
Another positive is the speed at which the engineers responded to the outage, read the telemetry, understood the problem, and responded with fixes. Sure, we’d all like DevOps to be even faster, but this looks as if the model worked.
It’s obvious that the telemetry and data available for debugging problems is much broader and deeper than it is inside most on-premises deployments. But that’s how it should be as otherwise managing the 175,000-plus mailbox servers inside Exchange Online would be nigh-on impossible.
Not Doom and Gloom
I’m sure that the folks who sell products to help Office 365 tenant cope with cloud failures will seize on these outages to drive home their point that Office 365 is fallible. And they’re right. All cloud services are fallible. Anything can happen from the client workstation to the internet connection to DNS to a failure inside Microsoft.
In fact, failures happen all the time. But in most cases, the segmentation of Office 365 into regions, datacenters, and even Database Availability Groups lessen the potential of a failure to spread. The MFA outage in November is a notable example of where a single point of failure caused problems across multiple regions.
Hope for the Future
Azure Active Directory has had a bad run. Let’s hope that stability is restored and the next few months are quiet. In the interim, DownDetector.com is a good place to check if you think problems are brewing, and if you use Twitter, follow the Microsoft 365 status account to get live updates. And of course, we’ll keep an eye on things here!
For more information about how to cope with Office 365 outages, read Chapter 4 of the Office 365 for IT Pros eBook.