Updated November 22 with Microsoft preliminary root cause analysis and then on November 26 with the final version.
Two Outages in Ten Weeks
The second major outage for Azure multi-factor authentication (MFA) in two months brought some Office 365 to a halt on Monday, November 19. That is, until administrators understood what was happening and and disabled MFA for accounts to allow users to sign in.
The first outage occurred on September 4 when lightening struck Microsoft’s San Antonio datacenter. Post-postmortems published after the event (here’s the VSTS version) revealed how the impact of the outage rippled across multiple Microsoft cloud services, including MFA.
The original problem statement was:
“Customers in Europe, Asia-Pacific and the Americas regions may experience difficulties signing into Azure resources, such as Azure Active Directory, when Multi-Factor Authentication is required by policy.”
The official word on the Azure incident history page says:
“A recent update was deployed to improve connections to caching services for the MFA service, this introduced a race condition which prevented users from being able to sign-in, or carry-out self-service password resets, when using MFA services…
Engineers initially rolled back the deployment which eliminated the connection between the Azure MFA service and the backend caching service. Engineers subsequently cycled impacted servers which allowed authentication requests to succeed.“
On November 26, Microsoft updated the root cause analysis to say:
“There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time.
The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.
2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.
3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.”
The worrying part of the story is that a code update proved to be unreliable when introduced into production, which is not good for Microsoft’s cloud quality and testing regimes.
The incident started at 04:39 UTC and stopped users completing the MFA secondary challenge and sign-into services like Office 365. For instance, the text message containing the code to prove that the account owner has the device registered for the account never arrives, meaning that the challenge shown below can never be completed.
According to Microsoft, the problem started to ease at around 14:45 UTC after a hotfix was deployed. It takes a long time to deploy code fixes across a massive infrastructure and many tenants were affected by the problem for several hours afterwards. I first managed to authenticate with MFA at 18:23 UTC. Others were not so lucky and the lack of connectivity persisted for several hours afterwards. The incident slowly wound down and, at the time of writing, the situation is being monitored by Microsoft but everything is working.
Overall, Monday wasn’t a great day for users or administrators alike. MFA-enabled accounts couldn’t access Office 365 applications if their refresh token expired and they needed to go through the MFA sign-in process to reauthenticate. Administrators, whose accounts are more likely to be protected by MFA, hit the same issue and lost access to Office 365 and Azure portals.
During the incident, Microsoft communicated with customers via the Office 365 Admin Center and the Azure status page, but didn’t always give the same story in both places. For instance, around 14:00, on the service health page of the Office 365 portal, we learned:
While we continue to develop the code update, we’re exploring additional workstreams to find a path to mitigation.
While at the same time, the Azure portal told watchers that:
Engineers have explored mitigating a back-end service via deploying a code hotfix, and this is currently being validated in a staging environment to verify before potential roll-out to production. Engineers are also continuing to explore additional workstreams to expedite mitigation.
Obviously, the text posted on the Azure portal gave more complete information. One wonders why the people responsible for updating the portals couldn’t have used the same story?
On the one hand it’s reasonable that Azure should have its own communications because its services are used by more than Office 365. On the other, Microsoft runs both services and it is strange to have Office 365 give less information than is publicly available elsewhere.
Microsoft Says to Use MFA
Microsoft recommends that Office 365 tenants use MFA. The priority is to protect accounts with privileged access, like tenant administrators, followed by high-profile accounts, like those used by executives. However, for maximum protection against hacker attacks, all Office 365 user accounts should use MFA.
Microsoft reinforces the message by giving tenants that use MFA a big boost in their Secure Score (if that means anything to you). Generally speaking, I agree with Microsoft and think that all accounts should be protected. Until, that is, something bad happened and users can’t sign into Office 365 or any other Microsoft cloud service because of an MFA failure. It’s worth underlining that the problem only surfaces for new connections or when a user’s access token expires and needs to be renewed. While the access token is still valid, users can continue to connect even with a broken MFA service.
Disable MFA to Keep Working
The question then becomes what should a tenant in case of an extended an MFA outage when users need to get into Office 365 or other services and can’t because they cannot complete the MFA process. The obvious answer is to disable MFA for affected user accounts while the outage continues and then re-enable the accounts for MFA immediately the outage is over and normal service resumes. Of course, this assumes that you can still sign into an administrator account to reset MFA for users. But keeping an admin account that isn’t secured with MFA is a bad idea, isn’t it?
Not if it’s a “breakglass” account. In other words, a privileged account that can be used in case of emergency and other administrator accounts are unavailable for some reason. See this article for a discussion on the topic as well as some advice from Microsoft on how to manage emergency administrative accounts for Azure Active Directory.
The lesson of the outage are clear. If they use MFA (as they should), Office 365 tenants need to be prepared to deal with outages. Knowing what accounts are protected with MFA is a start, being able to disable MFA if needed is a good idea (and revert once the problem eases), and having a breakglass account is also sensible.
The process to enable Office 365 accounts for multi-factor authentication is covered in Chapter 3 of the Office 365 for IT Pros eBook. We’re not so hot on disabling MFA…