Everything was progressing normally until lunchtime (UTC) Monday when my Teams desktop decided that it didn’t want to connect to the Teams back-end services any longer. Dutifully reporting “D’oh Something went wrong…” and issuing a totally unhelpful 500 error code because the client couldn’t connect to https://teams.microsoft.com (or even just https:// at times), the problem turned out to be the first major world-wide outage for Teams.
Error 500 as the Teams client can’t connect to its services
During the incident the Teams mobile client continued working. This is probably due to the way that the desktop/browser clients authenticate once an hour to refresh their tokens while the mobile client uses a different mechanism. The desktop and browser clients are built with Electron and the desktop client is essentially a wrapper around the browser client. Hence the common behavior. In any case, once the time came for the client to reauthenticate itself, it failed and “D’oh” appeared. No amount of signing out and back in again helped because the problem existed in the Teams back-end services and the client could not obtain the necessary token.
No Joy Found in Teams Logs
Examining the Teams logs (click the Teams logo in the system tray and select Get Logs from the menu) didn’t shed any light onto the problem. Here’s an example:
Scenario.Name: desktop_win_sso, Scenario.Step: stop, Scenario.Status: success,
Mon Feb 18 2019 15:35:47 GMT+0000 (GMT Standard Time) <23544> -- event -- name: desktop_page_bad_response, responseCode: 500, errorUrl: https://,
Mon Feb 18 2019 15:35:47 GMT+0000 (GMT Standard Time) <23544> -- info -- Handled get response details with http response code 500: https://
If you search the internet, the advice for dealing with a 500 error is often to remove all Teams credentials from the Windows Credential Manager. That can help if you have some local corruption, but it has absolutely no effect when a back-end service is bust.
Understanding the Status of Incident TM173756
The Microsoft 365 Status Twitter account informed the world that incident TM173756 was progressing. Further information was available in the Office 365 Admin Center. According to the Admin Center, the incident began at 8:23 UTC. I wasn’t affected until around 13:00 as part of a spike in problems (perhaps when the throttling referred to below happened) reported to DownDetector.com happened. Microsoft’s summary given when the incident finished at 18:00 UTC was:
Final status: Further investigation determined that rerouting traffic to alternate healthy infrastructure didn’t have the desired effect. Engineers implemented a configuration change to improve efficiency of Teams authentication components to completely remediate impact.
Preliminary root cause: A transient error, or currently unidentified update caused Teams front-end services to encounter errors when attempting to fetch password store keys. The errors resulted in many retries to the service that contains the key values, and eventually the service throttled attempts to fetch further keys in order to prevent further impact to other services. Engineers updated the configuration within the authentication service to mitigate password store key retrieval issues.
Not that regular users knew about these sources of information and were able to find out what was happening. All they knew was that they couldn’t get into Teams. Because Teams is an online service with no offline capability (unlike Outlook, for instance), user productivity within Teams fell dramatically. On the upside, resources connected to Teams like Planner and SharePoint continued to be available and accessible to users.
Because this was the first major worldwide outage for Teams, we haven’t seen the effect of a major problem for Teams before. With over 420,000 organizations now using Teams, the potential impact on customers was obvious.
The Post-Incident Report
Within 48 hours of a serious incident, Microsoft issues a Post-Incident Report (PIR) to explain what happened and the actions they propose to take to avoid similar situations in the future. The preliminary version of the PIR is now available to Office 365 tenants affected by the outage through the Service Health (History) section of the Office 365 Admin Center (or download using the link below). The findings of the PIR might change over time as more information becomes available to Microsoft.
Although the “underlying catalyst of the issue is still under investigation” (Microsoft-speak to say that they still don’t know exactly what caused the problem), the PIR gives some insight into the problem and how Microsoft worked to restore service.
Analyzing the Outage
Stepping through the PIR, we find the following:
The first report to indicate a problem appeared in telemetry at 8:23 UTC.
Microsoft seemed to regard the telemetry as being inconsistencies rather than a real issue until 13:29 UTC when load spiked, possibly due to load coming from U.S.-based tenants at the start of their working day. You can see the spike in the DownDetecter.com graph.
28 minutes later, Microsoft made the incident a high-priority investigation and started to analyze the telemetry. The delay is possibly due to waiting to see if the underlying cause of the spike rectified itself as well as the time needed to understand exactly what was going on.
Pretty quickly, engineers figured out that the problem was confined to the browser and desktop clients. However, it then took a further hour before they reviewed recent changes and decided to roll back a change made on February 15 (15:24 UTC).
The rollback had no effect. At 15:51 UTC, attention focused on Azure Key Vault, one of the services Teams depends on. Given that users had issues signing into Teams, the problem was always likely to lie along the authentication route. Some 30 minutes later, engineers found that “service automation” had throttled access to Key Vault to stop multiple retries by Teams clients from affecting other services that depend on Key Vault.
at 16:30 UTC, a failover to “alternative authentication components” began (we’re not told what these components are) and 14 minutes later after the failover completed, the service health began to improve for U.S.-based customers. European customers took longer (my service was restored at 18:00 UTC).
Some problems were noted after the failure that were fixed by a configuration change. The incident finished at 18:00 UTC.
The PIR notes that Microsoft has made a fix to stop the same issue happening again.
Overall, some criticism might be made of the five-and-a-half hour delay between the first observation that an issue might exist and the time when Microsoft engineering swung into high-priority action. However, the nature of cloud services is that they generate a ton of telemetry and not every signal means that a problem exists. The PIR notes that “internal automation” triggered an alert when a threshold was reached (13:29 UTC), which corresponds to when user load increased.
As you’d expect, Office 365 administration is highly automated to help humans decide when they need to intervene, which is exactly what happened here. Once the problem was declared an incident, things progressed reasonably quickly given the scale of the impact on users. What’s interesting is that this was a world-wide outage affecting users in multiple Office 365 datacenter regions. This points to a single point of failure (like the MFA service outage in November 2018). It would be good if Microsoft addressed these weaknesses too as they investigate and remedy errors in the service.
Slack has Outages Too
Slack is the major competitor for Teams. To be fair to Teams, Slack has its own problems and outages. It just goes to show that at times cloud services will experience issues. The question is less about how issues occur, it’s more about how quickly service providers recover and their communication with customers. In this instance, Teams recovered reasonably quickly but Microsoft has still work to do when it comes to communications.
Chapter 13 of the Office 365 for IT Pros eBook is the best place to learn about Teams. You might also want to delve into Chapter 4, because that’s where we cover things like the Office 365 Admin Center. And Chapter 2 is where we talk about PIRs. We have lots of stuff that’s relevant to this discussion.
{"id":null,"mode":"button","open_style":"in_modal","currency_code":"EUR","currency_symbol":"\u20ac","currency_type":"decimal","blank_flag_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/blank.gif","flag_sprite_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/flags.png","default_amount":100,"top_media_type":"featured_image","featured_image_url":"https:\/\/office365itpros.com\/wp-content\/uploads\/2022\/11\/cover-141x200.jpg","featured_embed":"","header_media":null,"file_download_attachment_data":null,"recurring_options_enabled":true,"recurring_options":{"never":{"selected":true,"after_output":"One time only"},"weekly":{"selected":false,"after_output":"Every week"},"monthly":{"selected":false,"after_output":"Every month"},"yearly":{"selected":false,"after_output":"Every year"}},"strings":{"current_user_email":"","current_user_name":"","link_text":"Virtual Tip Jar","complete_payment_button_error_text":"Check info and try again","payment_verb":"Pay","payment_request_label":"Office 365 for IT Pros","form_has_an_error":"Please check and fix the errors above","general_server_error":"Something isn't working right at the moment. Please try again.","form_title":"Office 365 for IT Pros","form_subtitle":null,"currency_search_text":"Country or Currency here","other_payment_option":"Other payment option","manage_payments_button_text":"Manage your payments","thank_you_message":"Thank you for supporting the work of Office 365 for IT Pros!","payment_confirmation_title":"Office 365 for IT Pros","receipt_title":"Your Receipt","print_receipt":"Print Receipt","email_receipt":"Email Receipt","email_receipt_sending":"Sending receipt...","email_receipt_success":"Email receipt successfully sent","email_receipt_failed":"Email receipt failed to send. Please try again.","receipt_payee":"Paid to","receipt_statement_descriptor":"This will show up on your statement as","receipt_date":"Date","receipt_transaction_id":"Transaction ID","receipt_transaction_amount":"Amount","refund_payer":"Refund from","login":"Log in to manage your payments","manage_payments":"Manage Payments","transactions_title":"Your Transactions","transaction_title":"Transaction Receipt","transaction_period":"Plan Period","arrangements_title":"Your Plans","arrangement_title":"Manage Plan","arrangement_details":"Plan Details","arrangement_id_title":"Plan ID","arrangement_payment_method_title":"Payment Method","arrangement_amount_title":"Plan Amount","arrangement_renewal_title":"Next renewal date","arrangement_action_cancel":"Cancel Plan","arrangement_action_cant_cancel":"Cancelling is currently not available.","arrangement_action_cancel_double":"Are you sure you'd like to cancel?","arrangement_cancelling":"Cancelling Plan...","arrangement_cancelled":"Plan Cancelled","arrangement_failed_to_cancel":"Failed to cancel plan","back_to_plans":"\u2190 Back to Plans","update_payment_method_verb":"Update","sca_auth_description":"Your have a pending renewal payment which requires authorization.","sca_auth_verb":"Authorize renewal payment","sca_authing_verb":"Authorizing payment","sca_authed_verb":"Payment successfully authorized!","sca_auth_failed":"Unable to authorize! Please try again.","login_button_text":"Log in","login_form_has_an_error":"Please check and fix the errors above","uppercase_search":"Search","lowercase_search":"search","uppercase_page":"Page","lowercase_page":"page","uppercase_items":"Items","lowercase_items":"items","uppercase_per":"Per","lowercase_per":"per","uppercase_of":"Of","lowercase_of":"of","back":"Back to plans","zip_code_placeholder":"Zip\/Postal Code","download_file_button_text":"Download File","input_field_instructions":{"tip_amount":{"placeholder_text":"How much would you like to tip?","initial":{"instruction_type":"normal","instruction_message":"How much would you like to tip? Choose any currency."},"empty":{"instruction_type":"error","instruction_message":"How much would you like to tip? Choose any currency."},"invalid_curency":{"instruction_type":"error","instruction_message":"Please choose a valid currency."}},"recurring":{"placeholder_text":"Recurring","initial":{"instruction_type":"normal","instruction_message":"How often would you like to give this?"},"success":{"instruction_type":"success","instruction_message":"How often would you like to give this?"},"empty":{"instruction_type":"error","instruction_message":"How often would you like to give this?"}},"name":{"placeholder_text":"Name on Credit Card","initial":{"instruction_type":"normal","instruction_message":"Enter the name on your card."},"success":{"instruction_type":"success","instruction_message":"Enter the name on your card."},"empty":{"instruction_type":"error","instruction_message":"Please enter the name on your card."}},"privacy_policy":{"terms_title":"Terms and conditions","terms_body":null,"terms_show_text":"View Terms","terms_hide_text":"Hide Terms","initial":{"instruction_type":"normal","instruction_message":"I agree to the terms."},"unchecked":{"instruction_type":"error","instruction_message":"Please agree to the terms."},"checked":{"instruction_type":"success","instruction_message":"I agree to the terms."}},"email":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email address"},"success":{"instruction_type":"success","instruction_message":"Enter your email address"},"blank":{"instruction_type":"error","instruction_message":"Enter your email address"},"not_an_email_address":{"instruction_type":"error","instruction_message":"Make sure you have entered a valid email address"}},"note_with_tip":{"placeholder_text":"Your note here...","initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"empty":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"not_empty_initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"saving":{"instruction_type":"normal","instruction_message":"Saving note..."},"success":{"instruction_type":"success","instruction_message":"Note successfully saved!"},"error":{"instruction_type":"error","instruction_message":"Unable to save note note at this time. Please try again."}},"email_for_login_code":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email to log in."},"success":{"instruction_type":"success","instruction_message":"Enter your email to log in."},"blank":{"instruction_type":"error","instruction_message":"Enter your email to log in."},"empty":{"instruction_type":"error","instruction_message":"Enter your email to log in."}},"login_code":{"initial":{"instruction_type":"normal","instruction_message":"Check your email and enter the login code."},"success":{"instruction_type":"success","instruction_message":"Check your email and enter the login code."},"blank":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."},"empty":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."}},"stripe_all_in_one":{"initial":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"empty":{"instruction_type":"error","instruction_message":"Enter your credit card details here."},"success":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"invalid_number":{"instruction_type":"error","instruction_message":"The card number is not a valid credit card number."},"invalid_expiry_month":{"instruction_type":"error","instruction_message":"The card's expiration month is invalid."},"invalid_expiry_year":{"instruction_type":"error","instruction_message":"The card's expiration year is invalid."},"invalid_cvc":{"instruction_type":"error","instruction_message":"The card's security code is invalid."},"incorrect_number":{"instruction_type":"error","instruction_message":"The card number is incorrect."},"incomplete_number":{"instruction_type":"error","instruction_message":"The card number is incomplete."},"incomplete_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incomplete."},"incomplete_expiry":{"instruction_type":"error","instruction_message":"The card's expiration date is incomplete."},"incomplete_zip":{"instruction_type":"error","instruction_message":"The card's zip code is incomplete."},"expired_card":{"instruction_type":"error","instruction_message":"The card has expired."},"incorrect_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incorrect."},"incorrect_zip":{"instruction_type":"error","instruction_message":"The card's zip code failed validation."},"invalid_expiry_year_past":{"instruction_type":"error","instruction_message":"The card's expiration year is in the past"},"card_declined":{"instruction_type":"error","instruction_message":"The card was declined."},"missing":{"instruction_type":"error","instruction_message":"There is no card on a customer that is being charged."},"processing_error":{"instruction_type":"error","instruction_message":"An error occurred while processing the card."},"invalid_request_error":{"instruction_type":"error","instruction_message":"Unable to process this payment, please try again or use alternative method."},"invalid_sofort_country":{"instruction_type":"error","instruction_message":"The billing country is not accepted by SOFORT. Please try another country."}}}},"fetched_oembed_html":false}
One Reply to “Analyzing the Teams Outage of 18 February 2019”