Fixing the Achilles Heel of Office 365
Over the years, Office 365 has maintained a very high level of service, comfortably exceeding Microsoft’s financially-backed 99.9% SLA target since some initial glitches soon after Office 365 launched in 2011. When things have gone wrong recently, Azure AD has often been the source of problems. Authentication woes of one sort or another have existed since the start of Office 365. As long ago as 2015, I asked the question if Azure AD was the Achilles heel for Office 365.
In some respects, it’s natural that the directory service should be a hyper-critical component of any service, whether cloud or on-premises. If a client cannot authenticate, it cannot access resources. It’s a simple equation proven by the many instances when loss of authentication capability brought clients to a crashing halt.
The Backup Authentication Service and Its Three-Day Cache of Successful Authentications
Microsoft has gradually improved the resilience of Azure AD over the years by eliminating single point of failures (like the MFA service) and building out capacity. The latest innovation is a backup authentication service, aimed at underpinning 99.99% authentication uptime for Azure AD.
A November 22 blog post explains the plan. Microsoft has implemented a service which monitors Azure AD to detect any outages. When an outage occurs, the backup authentication service swings into operation to handle authentication requests from clients, which are routed to it automatically by the Azure AD gateway (the first point of contact When the primary instance of Azure AD recovers, the backup service dynamically reroutes requests to it and returns to monitoring (normal) mode.
The backup service handles authentication requests using information derived from successful authentication requests processed by Azure AD (Figure 1). This information can be up to three days old. It’s enough for the backup service to validate that an application successfully authenticated at a point in time within the last three days and go ahead to generate an authentication response to allow the application to proceed. According to Microsoft, more than 90% of authentication requests processed by Azure AD are for existing client sessions. These can all be handled by the backup service. On the other hand, because the backup service doesn’t have any data for new sessions, it cannot handle these requests (or requests from guest accounts).
Conditional Access and Resilience Defaults
Apart from dealing with authentication requests, the backup authentication service enforces multi-factor authentication, conditional access policies, and continuous access evaluation to ensure that invalid credentials cannot be used.
To ensure as high a level of continuity of service as possible, the backup authentication service uses resilience defaults for conditional access policies to allow it to continue without depending on conditions such as sign-in risk or group membership that aren’t available in real time when the primary Azure AD service is offline. Essentially, the policies proceed on the basis that conditions have not changed since Azure AD went offline. Organizations who don’t want this to happen can disable the resilience defaults for all or some conditional access policies through the Azure AD admin center (Figure 2) in the knowledge that this will affect the ability of some client connects to authenticate.
Gradual Introduction Since 2019
The best thing of all is that the backup authentication service has been in place for OWA and SharePoint Online since 2019. In early 2021, Microsoft added support for “native” apps, including the Microsoft 365 apps for enterprise (like Outlook) and the Teams desktop and mobile clients. Because the rerouting of authentication requests happens at the gateway level and the responses from the backup authentication service are identical to those issued by the primary Azure AD service, no client reconfiguration or special settings are necessary.
Microsoft is now upgrading support to bring in apps using Open ID Connect, starting with their own Teams Online and Office Online apps and progressing to customer apps. They expect to begin rolling out this support at the end of 2021. At that point, all of Office 365 should be protected by the backup authentication service, which should mean that any future Azure AD outage should be much less serious than previous events.
Storing information about successful authentication requests for three days and using that information to allow people to continue working if the Azure AD service goes offline seems like a reasonable balance between utility and security. We know that Azure AD outages have occurred in the past. We also know that Azure AD continues to handle more traffic over time (now at 500 million monthly active users generating tens of billions of authentications daily, according to Microsoft), which creates stress on its own. Some Azure AD outages can be expected in the future. The good news is that a backup supply for authentications, at least for the majority of client sessions, is now available.
Learn more about how Office 365 really works on an ongoing basis by subscribing to the Office 365 for IT Pros eBook. Our monthly updates keep subscribers informed about what’s important across the Office 365 ecosystem.