Analyzing the Teams Outage of 18 February 2019

Teams Hits the Buffers

Everything was progressing normally until lunchtime (UTC) Monday when my Teams desktop decided that it didn’t want to connect to the Teams back-end services any longer. Dutifully reporting “D’oh Something went wrong…” and issuing a totally unhelpful 500 error code because the client couldn’t connect to https://teams.microsoft.com (or even just https:// at times), the problem turned out to be the first major world-wide outage for Teams.

Error 500 as the Teams client can't connect to its services
Error 500 as the Teams client can’t connect to its services

During the incident the Teams mobile client continued working. This is probably due to the way that the desktop/browser clients authenticate once an hour to refresh their tokens while the mobile client uses a different mechanism. The desktop and browser clients are built with Electron and the desktop client is essentially a wrapper around the browser client. Hence the common behavior. In any case, once the time came for the client to reauthenticate itself, it failed and “D’oh” appeared. No amount of signing out and back in again helped because the problem existed in the Teams back-end services and the client could not obtain the necessary token.

No Joy Found in Teams Logs

Examining the Teams logs (click the Teams logo in the system tray and select Get Logs from the menu) didn’t shed any light onto the problem. Here’s an example:

If you search the internet, the advice for dealing with a 500 error is often to remove all Teams credentials from the Windows Credential Manager. That can help if you have some local corruption, but it has absolutely no effect when a back-end service is bust.

Understanding the Status of Incident TM173756

The Microsoft 365 Status Twitter account informed the world that incident TM173756 was progressing. Further information was available in the Office 365 Admin Center. According to the Admin Center, the incident began at 8:23 UTC. I wasn’t affected until around 13:00 as part of a spike in problems (perhaps when the throttling referred to below happened) reported to DownDetector.com happened. Microsoft’s summary given when the incident finished at 18:00 UTC was:

Final status: Further investigation determined that rerouting traffic to alternate healthy infrastructure didn’t have the desired effect. Engineers implemented a configuration change to improve efficiency of Teams authentication components to completely remediate impact.

Preliminary root cause: A transient error, or currently unidentified update caused Teams front-end services to encounter errors when attempting to fetch password store keys. The errors resulted in many retries to the service that contains the key values, and eventually the service throttled attempts to fetch further keys in order to prevent further impact to other services. Engineers updated the configuration within the authentication service to mitigate password store key retrieval issues.

Not that regular users knew about these sources of information and were able to find out what was happening. All they knew was that they couldn’t get into Teams. Because Teams is an online service with no offline capability (unlike Outlook, for instance), user productivity within Teams fell dramatically. On the upside, resources connected to Teams like Planner and SharePoint continued to be available and accessible to users.

Because this was the first major worldwide outage for Teams, we haven’t seen the effect of a major problem for Teams before. With over 420,000 organizations now using Teams, the potential impact on customers was obvious.

The Post-Incident Report

Within 48 hours of a serious incident, Microsoft issues a Post-Incident Report (PIR) to explain what happened and the actions they propose to take to avoid similar situations in the future. The preliminary version of the PIR is now available to Office 365 tenants affected by the outage through the Service Health (History) section of the Office 365 Admin Center (or download using the link below). The findings of the PIR might change over time as more information becomes available to Microsoft.

Where to find the Post Incident Report for the Teams outage TM173756
Where to find the PIR for Teams incident TM173756

Although the “underlying catalyst of the issue is still under investigation” (Microsoft-speak to say that they still don’t know exactly what caused the problem), the PIR gives some insight into the problem and how Microsoft worked to restore service.

Analyzing the Outage

Stepping through the PIR, we find the following:

  • The first report to indicate a problem appeared in telemetry at 8:23 UTC.
  • Microsoft seemed to regard the telemetry as being inconsistencies rather than a real issue until 13:29 UTC when load spiked, possibly due to load coming from U.S.-based tenants at the start of their working day. You can see the spike in the DownDetecter.com graph.
  • 28 minutes later, Microsoft made the incident a high-priority investigation and started to analyze the telemetry. The delay is possibly due to waiting to see if the underlying cause of the spike rectified itself as well as the time needed to understand exactly what was going on.
  • Pretty quickly, engineers figured out that the problem was confined to the browser and desktop clients. However, it then took a further hour before they reviewed recent changes and decided to roll back a change made on February 15 (15:24 UTC).
  • The rollback had no effect. At 15:51 UTC, attention focused on Azure Key Vault, one of the services Teams depends on. Given that users had issues signing into Teams, the problem was always likely to lie along the authentication route. Some 30 minutes later, engineers found that “service automation” had throttled access to Key Vault to stop multiple retries by Teams clients from affecting other services that depend on Key Vault.
  • at 16:30 UTC, a failover to “alternative authentication components” began (we’re not told what these components are) and 14 minutes later after the failover completed, the service health began to improve for U.S.-based customers. European customers took longer (my service was restored at 18:00 UTC).
  • Some problems were noted after the failure that were fixed by a configuration change. The incident finished at 18:00 UTC.

The PIR notes that Microsoft has made a fix to stop the same issue happening again.

Overall, some criticism might be made of the five-and-a-half hour delay between the first observation that an issue might exist and the time when Microsoft engineering swung into high-priority action. However, the nature of cloud services is that they generate a ton of telemetry and not every signal means that a problem exists. The PIR notes that “internal automation” triggered an alert when a threshold was reached (13:29 UTC), which corresponds to when user load increased.

As you’d expect, Office 365 administration is highly automated to help humans decide when they need to intervene, which is exactly what happened here. Once the problem was declared an incident, things progressed reasonably quickly given the scale of the impact on users. What’s interesting is that this was a world-wide outage affecting users in multiple Office 365 datacenter regions. This points to a single point of failure (like the MFA service outage in November 2018). It would be good if Microsoft addressed these weaknesses too as they investigate and remedy errors in the service.

Slack has Outages Too

Slack is the major competitor for Teams. To be fair to Teams, Slack has its own problems and outages. It just goes to show that at times cloud services will experience issues. The question is less about how issues occur, it’s more about how quickly service providers recover and their communication with customers. In this instance, Teams recovered reasonably quickly but Microsoft has still work to do when it comes to communications.


Chapter 13 of the Office 365 for IT Pros eBook is the best place to learn about Teams. You might also want to delve into Chapter 4, because that’s where we cover things like the Office 365 Admin Center. And Chapter 2 is where we talk about PIRs. We have lots of stuff that’s relevant to this discussion.

Advertisements

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.