Updated November 22 with Microsoft preliminary root cause analysis and then on November 26 with the final version.
Table of Contents
Two Outages in Ten Weeks
The second major outage for Azure multi-factor authentication (MFA) in two months brought some Office 365 to a halt on Monday, November 19. That is, until administrators understood what was happening and and disabled MFA for accounts to allow users to sign in.
The first outage occurred on September 4 when lightening struck Microsoft’s San Antonio datacenter. Post-postmortems published after the event (here’s the VSTS version) revealed how the impact of the outage rippled across multiple Microsoft cloud services, including MFA.
The Problem
The original problem statement was:
“Customers in Europe, Asia-Pacific and the Americas regions may experience difficulties signing into Azure resources, such as Azure Active Directory, when Multi-Factor Authentication is required by policy.”
“A recent update was deployed to improve connections to caching services for the MFA service, this introduced a race condition which prevented users from being able to sign-in, or carry-out self-service password resets, when using MFA services…
Engineers initially rolled back the deployment which eliminated the connection between the Azure MFA service and the backend caching service. Engineers subsequently cycled impacted servers which allowed authentication requests to succeed.“
On November 26, Microsoft updated the root cause analysis to say:
“There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time.
The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause. 2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend. 3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.”
The worrying part of the story is that a code update proved to be unreliable when introduced into production, which is not good for Microsoft’s cloud quality and testing regimes.
Early-morning Incident
The incident started at 04:39 UTC and stopped users completing the MFA secondary challenge and sign-into services like Office 365. For instance, the text message containing the code to prove that the account owner has the device registered for the account never arrives, meaning that the challenge shown below can never be completed.
No chance of completion because the SMS never arrives.
According to Microsoft, the problem started to ease at around 14:45 UTC after a hotfix was deployed. It takes a long time to deploy code fixes across a massive infrastructure and many tenants were affected by the problem for several hours afterwards. I first managed to authenticate with MFA at 18:23 UTC. Others were not so lucky and the lack of connectivity persisted for several hours afterwards. The incident slowly wound down and, at the time of writing, the situation is being monitored by Microsoft but everything is working.
Overall, Monday wasn’t a great day for users or administrators alike. MFA-enabled accounts couldn’t access Office 365 applications if their refresh token expired and they needed to go through the MFA sign-in process to reauthenticate. Administrators, whose accounts are more likely to be protected by MFA, hit the same issue and lost access to Office 365 and Azure portals.
Communication Blues
During the incident, Microsoft communicated with customers via the Office 365 Admin Center and the Azure status page, but didn’t always give the same story in both places. For instance, around 14:00, on the service health page of the Office 365 portal, we learned:
While we continue to develop the code update, we’re exploring additional workstreams to find a path to mitigation.
While at the same time, the Azure portal told watchers that:
Engineers have explored mitigating a back-end service via deploying a code hotfix, and this is currently being validated in a staging environment to verify before potential roll-out to production. Engineers are also continuing to explore additional workstreams to expedite mitigation.
Obviously, the text posted on the Azure portal gave more complete information. One wonders why the people responsible for updating the portals couldn’t have used the same story?
What Azure says
On the one hand it’s reasonable that Azure should have its own communications because its services are used by more than Office 365. On the other, Microsoft runs both services and it is strange to have Office 365 give less information than is publicly available elsewhere.
Microsoft Says to Use MFA
Microsoft recommends that Office 365 tenants use MFA. The priority is to protect accounts with privileged access, like tenant administrators, followed by high-profile accounts, like those used by executives. However, for maximum protection against hacker attacks, all Office 365 user accounts should use MFA.
Microsoft reinforces the message by giving tenants that use MFA a big boost in their Secure Score (if that means anything to you). Generally speaking, I agree with Microsoft and think that all accounts should be protected. Until, that is, something bad happened and users can’t sign into Office 365 or any other Microsoft cloud service because of an MFA failure. It’s worth underlining that the problem only surfaces for new connections or when a user’s access token expires and needs to be renewed. While the access token is still valid, users can continue to connect even with a broken MFA service.
Disable MFA to Keep Working
The question then becomes what should a tenant in case of an extended an MFA outage when users need to get into Office 365 or other services and can’t because they cannot complete the MFA process. The obvious answer is to disable MFA for affected user accounts while the outage continues and then re-enable the accounts for MFA immediately the outage is over and normal service resumes. Of course, this assumes that you can still sign into an administrator account to reset MFA for users. But keeping an admin account that isn’t secured with MFA is a bad idea, isn’t it?
Breakglass Accounts
Not if it’s a “breakglass” account. In other words, a privileged account that can be used in case of emergency and other administrator accounts are unavailable for some reason. See this article for a discussion on the topic as well as some advice from Microsoft on how to manage emergency administrative accounts for Azure Active Directory.
The lesson of the outage are clear. If they use MFA (as they should), Office 365 tenants need to be prepared to deal with outages. Knowing what accounts are protected with MFA is a start, being able to disable MFA if needed is a good idea (and revert once the problem eases), and having a breakglass account is also sensible.
The process to enable Office 365 accounts for multi-factor authentication is covered in Chapter 3 of the Office 365 for IT Pros eBook. We’re not so hot on disabling MFA…
{"id":null,"mode":"button","open_style":"in_modal","currency_code":"EUR","currency_symbol":"\u20ac","currency_type":"decimal","blank_flag_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/blank.gif","flag_sprite_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/flags.png","default_amount":100,"top_media_type":"featured_image","featured_image_url":"https:\/\/office365itpros.com\/wp-content\/uploads\/2022\/11\/cover-141x200.jpg","featured_embed":"","header_media":null,"file_download_attachment_data":null,"recurring_options_enabled":true,"recurring_options":{"never":{"selected":true,"after_output":"One time only"},"weekly":{"selected":false,"after_output":"Every week"},"monthly":{"selected":false,"after_output":"Every month"},"yearly":{"selected":false,"after_output":"Every year"}},"strings":{"current_user_email":"","current_user_name":"","link_text":"Virtual Tip Jar","complete_payment_button_error_text":"Check info and try again","payment_verb":"Pay","payment_request_label":"Office 365 for IT Pros","form_has_an_error":"Please check and fix the errors above","general_server_error":"Something isn't working right at the moment. Please try again.","form_title":"Office 365 for IT Pros","form_subtitle":null,"currency_search_text":"Country or Currency here","other_payment_option":"Other payment option","manage_payments_button_text":"Manage your payments","thank_you_message":"Thank you for supporting the work of Office 365 for IT Pros!","payment_confirmation_title":"Office 365 for IT Pros","receipt_title":"Your Receipt","print_receipt":"Print Receipt","email_receipt":"Email Receipt","email_receipt_sending":"Sending receipt...","email_receipt_success":"Email receipt successfully sent","email_receipt_failed":"Email receipt failed to send. Please try again.","receipt_payee":"Paid to","receipt_statement_descriptor":"This will show up on your statement as","receipt_date":"Date","receipt_transaction_id":"Transaction ID","receipt_transaction_amount":"Amount","refund_payer":"Refund from","login":"Log in to manage your payments","manage_payments":"Manage Payments","transactions_title":"Your Transactions","transaction_title":"Transaction Receipt","transaction_period":"Plan Period","arrangements_title":"Your Plans","arrangement_title":"Manage Plan","arrangement_details":"Plan Details","arrangement_id_title":"Plan ID","arrangement_payment_method_title":"Payment Method","arrangement_amount_title":"Plan Amount","arrangement_renewal_title":"Next renewal date","arrangement_action_cancel":"Cancel Plan","arrangement_action_cant_cancel":"Cancelling is currently not available.","arrangement_action_cancel_double":"Are you sure you'd like to cancel?","arrangement_cancelling":"Cancelling Plan...","arrangement_cancelled":"Plan Cancelled","arrangement_failed_to_cancel":"Failed to cancel plan","back_to_plans":"\u2190 Back to Plans","update_payment_method_verb":"Update","sca_auth_description":"Your have a pending renewal payment which requires authorization.","sca_auth_verb":"Authorize renewal payment","sca_authing_verb":"Authorizing payment","sca_authed_verb":"Payment successfully authorized!","sca_auth_failed":"Unable to authorize! Please try again.","login_button_text":"Log in","login_form_has_an_error":"Please check and fix the errors above","uppercase_search":"Search","lowercase_search":"search","uppercase_page":"Page","lowercase_page":"page","uppercase_items":"Items","lowercase_items":"items","uppercase_per":"Per","lowercase_per":"per","uppercase_of":"Of","lowercase_of":"of","back":"Back to plans","zip_code_placeholder":"Zip\/Postal Code","download_file_button_text":"Download File","input_field_instructions":{"tip_amount":{"placeholder_text":"How much would you like to tip?","initial":{"instruction_type":"normal","instruction_message":"How much would you like to tip? Choose any currency."},"empty":{"instruction_type":"error","instruction_message":"How much would you like to tip? Choose any currency."},"invalid_curency":{"instruction_type":"error","instruction_message":"Please choose a valid currency."}},"recurring":{"placeholder_text":"Recurring","initial":{"instruction_type":"normal","instruction_message":"How often would you like to give this?"},"success":{"instruction_type":"success","instruction_message":"How often would you like to give this?"},"empty":{"instruction_type":"error","instruction_message":"How often would you like to give this?"}},"name":{"placeholder_text":"Name on Credit Card","initial":{"instruction_type":"normal","instruction_message":"Enter the name on your card."},"success":{"instruction_type":"success","instruction_message":"Enter the name on your card."},"empty":{"instruction_type":"error","instruction_message":"Please enter the name on your card."}},"privacy_policy":{"terms_title":"Terms and conditions","terms_body":null,"terms_show_text":"View Terms","terms_hide_text":"Hide Terms","initial":{"instruction_type":"normal","instruction_message":"I agree to the terms."},"unchecked":{"instruction_type":"error","instruction_message":"Please agree to the terms."},"checked":{"instruction_type":"success","instruction_message":"I agree to the terms."}},"email":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email address"},"success":{"instruction_type":"success","instruction_message":"Enter your email address"},"blank":{"instruction_type":"error","instruction_message":"Enter your email address"},"not_an_email_address":{"instruction_type":"error","instruction_message":"Make sure you have entered a valid email address"}},"note_with_tip":{"placeholder_text":"Your note here...","initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"empty":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"not_empty_initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"saving":{"instruction_type":"normal","instruction_message":"Saving note..."},"success":{"instruction_type":"success","instruction_message":"Note successfully saved!"},"error":{"instruction_type":"error","instruction_message":"Unable to save note note at this time. Please try again."}},"email_for_login_code":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email to log in."},"success":{"instruction_type":"success","instruction_message":"Enter your email to log in."},"blank":{"instruction_type":"error","instruction_message":"Enter your email to log in."},"empty":{"instruction_type":"error","instruction_message":"Enter your email to log in."}},"login_code":{"initial":{"instruction_type":"normal","instruction_message":"Check your email and enter the login code."},"success":{"instruction_type":"success","instruction_message":"Check your email and enter the login code."},"blank":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."},"empty":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."}},"stripe_all_in_one":{"initial":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"empty":{"instruction_type":"error","instruction_message":"Enter your credit card details here."},"success":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"invalid_number":{"instruction_type":"error","instruction_message":"The card number is not a valid credit card number."},"invalid_expiry_month":{"instruction_type":"error","instruction_message":"The card's expiration month is invalid."},"invalid_expiry_year":{"instruction_type":"error","instruction_message":"The card's expiration year is invalid."},"invalid_cvc":{"instruction_type":"error","instruction_message":"The card's security code is invalid."},"incorrect_number":{"instruction_type":"error","instruction_message":"The card number is incorrect."},"incomplete_number":{"instruction_type":"error","instruction_message":"The card number is incomplete."},"incomplete_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incomplete."},"incomplete_expiry":{"instruction_type":"error","instruction_message":"The card's expiration date is incomplete."},"incomplete_zip":{"instruction_type":"error","instruction_message":"The card's zip code is incomplete."},"expired_card":{"instruction_type":"error","instruction_message":"The card has expired."},"incorrect_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incorrect."},"incorrect_zip":{"instruction_type":"error","instruction_message":"The card's zip code failed validation."},"invalid_expiry_year_past":{"instruction_type":"error","instruction_message":"The card's expiration year is in the past"},"card_declined":{"instruction_type":"error","instruction_message":"The card was declined."},"missing":{"instruction_type":"error","instruction_message":"There is no card on a customer that is being charged."},"processing_error":{"instruction_type":"error","instruction_message":"An error occurred while processing the card."},"invalid_request_error":{"instruction_type":"error","instruction_message":"Unable to process this payment, please try again or use alternative method."},"invalid_sofort_country":{"instruction_type":"error","instruction_message":"The billing country is not accepted by SOFORT. Please try another country."}}}},"fetched_oembed_html":false}
3 Replies to “What Happens When MFA Fails?”