Downdetecter.com reports of Office 365 connectivity problems
Unhappy Run of Events
Azure Active Directory has been having an unhappy run of outages recently, notably in September 2018 when a lightning strike in Texas caused issues for many services and in November 2018 when a problem with the Multi-Factor Authentication service made it unavailable in multiple Office 365 regions. Another authentication outage flowing from Azure Active Directory problems happened on January 29. It’s an unfortunate run of problems that underlines the truth that if you can’t authenticate, you can’t connect, and you can’t work.
Things don’t seem to have changed much since 2015, but in fact had you asked the question in August 2018 when calm had existed for several months, you might have received a different answer. Clouds change quickly and can turn on you…
What Happened in an Incident
Following the January 24-25 problems, Microsoft issued a Post Incident Report (PIR) to explain what happened in two separate but conjoined incidents (EX172564 and EX172491). The first is about a failure in capacity that affected approximately 1% of the users served by the Office 365 EMEA datacenter region and could not connect to their mailboxes. The other fault affected approximately 10% of the users, but not as seriously.
The figures are from the PIR. After years of monitoring Office 365, Microsoft’s telemetry is well developed, and I am inclined to accept their data. Based on the flow of reports about outages that flowed in, you might have thought that much more than 1% of users were affected. This reflects the natural inclination of people who are affected to protest while the majority who aren’t affected stay silent (they’re working).
The root cause is stated to be a Windows Server component that handles User Datagram Protocol (UDP) transactions caused a kernel lock to be held for an extended period and resulted in Domain Controllers to crash. The resulting load caused problems for the remaining domain controllers because the pool of available controllers couldn’t handle the load on the system.
Future Changes
All systems can experience problems if available capacity is reduced below the level of user demand. The PIR says that Microsoft is conducting an architectural review to understand if they need to deploy extra scalability and resiliency options. They’re also looking at the way the automated recovery worked inside Office 365 when a situation like this happens so the processes work better in the future.
I guess what happened is a unique condition that Microsoft had not designed for. What’s bad about this situation is that the weakness of Azure Active Directory to handle spikes in load caused when capacity drops continues to be a concern. Given the essential nature of Azure Active Directory to the Office 365 ecosystem, it seems like Microsoft could do more to manage spikes when things go wrong.
What Went Right
On the upside, the segmentation of resources inside Office 365 limited the effect of the problem. Inside of all European users being affected, only those users whose accounts were in the forests served by the failed domain controllers had a problem. If your account was in another forest (like mine), you didn’t have a problem. This is an example of how not putting all your eggs in the proverbial basket really is a good idea.
Another positive is the speed at which the engineers responded to the outage, read the telemetry, understood the problem, and responded with fixes. Sure, we’d all like DevOps to be even faster, but this looks as if the model worked.
It’s obvious that the telemetry and data available for debugging problems is much broader and deeper than it is inside most on-premises deployments. But that’s how it should be as otherwise managing the 175,000-plus mailbox servers inside Exchange Online would be nigh-on impossible.
Not Doom and Gloom
I’m sure that the folks who sell products to help Office 365 tenant cope with cloud failures will seize on these outages to drive home their point that Office 365 is fallible. And they’re right. All cloud services are fallible. Anything can happen from the client workstation to the internet connection to DNS to a failure inside Microsoft.
In fact, failures happen all the time. But in most cases, the segmentation of Office 365 into regions, datacenters, and even Database Availability Groups lessen the potential of a failure to spread. The MFA outage in November is a notable example of where a single point of failure caused problems across multiple regions.
Hope for the Future
Azure Active Directory has had a bad run. Let’s hope that stability is restored and the next few months are quiet. In the interim, DownDetector.com is a good place to check if you think problems are brewing, and if you use Twitter, follow the Microsoft 365 status account to get live updates. And of course, we’ll keep an eye on things here!
For more information about how to cope with Office 365 outages, read Chapter 4 of the Office 365 for IT Pros eBook.
3 Replies to “Azure Active Directory Still a Weakness for Office 365”
Hi Tony,
What’s bad about this situation is that the weakness of Azure Active Directory to handle spikes in load caused when capacity drops continues to be a concern. Given the essential nature of Azure Active Directory to the Office 365 ecosystem, it seems like Microsoft could do more to manage spikes when things go wrong.
You know that this Outage has no dependencie to Azure Active Directory, because it was a DC that was Part of the Exchange Forest that caused that Incident. Azure Active Directory and the Exchange Online Active Directory Forest are only synced with each other but a Exchange Forest DC is not Part of the Azure AD Service…
{"id":null,"mode":"button","open_style":"in_modal","currency_code":"EUR","currency_symbol":"\u20ac","currency_type":"decimal","blank_flag_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/blank.gif","flag_sprite_url":"https:\/\/office365itpros.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/flags.png","default_amount":100,"top_media_type":"featured_image","featured_image_url":"https:\/\/office365itpros.com\/wp-content\/uploads\/2022\/11\/cover-141x200.jpg","featured_embed":"","header_media":null,"file_download_attachment_data":null,"recurring_options_enabled":true,"recurring_options":{"never":{"selected":true,"after_output":"One time only"},"weekly":{"selected":false,"after_output":"Every week"},"monthly":{"selected":false,"after_output":"Every month"},"yearly":{"selected":false,"after_output":"Every year"}},"strings":{"current_user_email":"","current_user_name":"","link_text":"Virtual Tip Jar","complete_payment_button_error_text":"Check info and try again","payment_verb":"Pay","payment_request_label":"Office 365 for IT Pros","form_has_an_error":"Please check and fix the errors above","general_server_error":"Something isn't working right at the moment. Please try again.","form_title":"Office 365 for IT Pros","form_subtitle":null,"currency_search_text":"Country or Currency here","other_payment_option":"Other payment option","manage_payments_button_text":"Manage your payments","thank_you_message":"Thank you for supporting the work of Office 365 for IT Pros!","payment_confirmation_title":"Office 365 for IT Pros","receipt_title":"Your Receipt","print_receipt":"Print Receipt","email_receipt":"Email Receipt","email_receipt_sending":"Sending receipt...","email_receipt_success":"Email receipt successfully sent","email_receipt_failed":"Email receipt failed to send. Please try again.","receipt_payee":"Paid to","receipt_statement_descriptor":"This will show up on your statement as","receipt_date":"Date","receipt_transaction_id":"Transaction ID","receipt_transaction_amount":"Amount","refund_payer":"Refund from","login":"Log in to manage your payments","manage_payments":"Manage Payments","transactions_title":"Your Transactions","transaction_title":"Transaction Receipt","transaction_period":"Plan Period","arrangements_title":"Your Plans","arrangement_title":"Manage Plan","arrangement_details":"Plan Details","arrangement_id_title":"Plan ID","arrangement_payment_method_title":"Payment Method","arrangement_amount_title":"Plan Amount","arrangement_renewal_title":"Next renewal date","arrangement_action_cancel":"Cancel Plan","arrangement_action_cant_cancel":"Cancelling is currently not available.","arrangement_action_cancel_double":"Are you sure you'd like to cancel?","arrangement_cancelling":"Cancelling Plan...","arrangement_cancelled":"Plan Cancelled","arrangement_failed_to_cancel":"Failed to cancel plan","back_to_plans":"\u2190 Back to Plans","update_payment_method_verb":"Update","sca_auth_description":"Your have a pending renewal payment which requires authorization.","sca_auth_verb":"Authorize renewal payment","sca_authing_verb":"Authorizing payment","sca_authed_verb":"Payment successfully authorized!","sca_auth_failed":"Unable to authorize! Please try again.","login_button_text":"Log in","login_form_has_an_error":"Please check and fix the errors above","uppercase_search":"Search","lowercase_search":"search","uppercase_page":"Page","lowercase_page":"page","uppercase_items":"Items","lowercase_items":"items","uppercase_per":"Per","lowercase_per":"per","uppercase_of":"Of","lowercase_of":"of","back":"Back to plans","zip_code_placeholder":"Zip\/Postal Code","download_file_button_text":"Download File","input_field_instructions":{"tip_amount":{"placeholder_text":"How much would you like to tip?","initial":{"instruction_type":"normal","instruction_message":"How much would you like to tip? Choose any currency."},"empty":{"instruction_type":"error","instruction_message":"How much would you like to tip? Choose any currency."},"invalid_curency":{"instruction_type":"error","instruction_message":"Please choose a valid currency."}},"recurring":{"placeholder_text":"Recurring","initial":{"instruction_type":"normal","instruction_message":"How often would you like to give this?"},"success":{"instruction_type":"success","instruction_message":"How often would you like to give this?"},"empty":{"instruction_type":"error","instruction_message":"How often would you like to give this?"}},"name":{"placeholder_text":"Name on Credit Card","initial":{"instruction_type":"normal","instruction_message":"Enter the name on your card."},"success":{"instruction_type":"success","instruction_message":"Enter the name on your card."},"empty":{"instruction_type":"error","instruction_message":"Please enter the name on your card."}},"privacy_policy":{"terms_title":"Terms and conditions","terms_body":null,"terms_show_text":"View Terms","terms_hide_text":"Hide Terms","initial":{"instruction_type":"normal","instruction_message":"I agree to the terms."},"unchecked":{"instruction_type":"error","instruction_message":"Please agree to the terms."},"checked":{"instruction_type":"success","instruction_message":"I agree to the terms."}},"email":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email address"},"success":{"instruction_type":"success","instruction_message":"Enter your email address"},"blank":{"instruction_type":"error","instruction_message":"Enter your email address"},"not_an_email_address":{"instruction_type":"error","instruction_message":"Make sure you have entered a valid email address"}},"note_with_tip":{"placeholder_text":"Your note here...","initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"empty":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"not_empty_initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"saving":{"instruction_type":"normal","instruction_message":"Saving note..."},"success":{"instruction_type":"success","instruction_message":"Note successfully saved!"},"error":{"instruction_type":"error","instruction_message":"Unable to save note note at this time. Please try again."}},"email_for_login_code":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email to log in."},"success":{"instruction_type":"success","instruction_message":"Enter your email to log in."},"blank":{"instruction_type":"error","instruction_message":"Enter your email to log in."},"empty":{"instruction_type":"error","instruction_message":"Enter your email to log in."}},"login_code":{"initial":{"instruction_type":"normal","instruction_message":"Check your email and enter the login code."},"success":{"instruction_type":"success","instruction_message":"Check your email and enter the login code."},"blank":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."},"empty":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."}},"stripe_all_in_one":{"initial":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"empty":{"instruction_type":"error","instruction_message":"Enter your credit card details here."},"success":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"invalid_number":{"instruction_type":"error","instruction_message":"The card number is not a valid credit card number."},"invalid_expiry_month":{"instruction_type":"error","instruction_message":"The card's expiration month is invalid."},"invalid_expiry_year":{"instruction_type":"error","instruction_message":"The card's expiration year is invalid."},"invalid_cvc":{"instruction_type":"error","instruction_message":"The card's security code is invalid."},"incorrect_number":{"instruction_type":"error","instruction_message":"The card number is incorrect."},"incomplete_number":{"instruction_type":"error","instruction_message":"The card number is incomplete."},"incomplete_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incomplete."},"incomplete_expiry":{"instruction_type":"error","instruction_message":"The card's expiration date is incomplete."},"incomplete_zip":{"instruction_type":"error","instruction_message":"The card's zip code is incomplete."},"expired_card":{"instruction_type":"error","instruction_message":"The card has expired."},"incorrect_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incorrect."},"incorrect_zip":{"instruction_type":"error","instruction_message":"The card's zip code failed validation."},"invalid_expiry_year_past":{"instruction_type":"error","instruction_message":"The card's expiration year is in the past"},"card_declined":{"instruction_type":"error","instruction_message":"The card was declined."},"missing":{"instruction_type":"error","instruction_message":"There is no card on a customer that is being charged."},"processing_error":{"instruction_type":"error","instruction_message":"An error occurred while processing the card."},"invalid_request_error":{"instruction_type":"error","instruction_message":"Unable to process this payment, please try again or use alternative method."},"invalid_sofort_country":{"instruction_type":"error","instruction_message":"The billing country is not accepted by SOFORT. Please try another country."}}}},"fetched_oembed_html":false}
Hi Tony,
What’s bad about this situation is that the weakness of Azure Active Directory to handle spikes in load caused when capacity drops continues to be a concern. Given the essential nature of Azure Active Directory to the Office 365 ecosystem, it seems like Microsoft could do more to manage spikes when things go wrong.
You know that this Outage has no dependencie to Azure Active Directory, because it was a DC that was Part of the Exchange Forest that caused that Incident. Azure Active Directory and the Exchange Online Active Directory Forest are only synced with each other but a Exchange Forest DC is not Part of the Azure AD Service…
While you’re right that the DC was part of the Exchange forest, I view the overall infrastructure as a whole… Hence my view.