Will the New SLA Make Any Difference in Practice?
Message Center notification MC301471 (December 3, 2021) announces that effective December 1, the Service Level Agreement (SLA) covering the Microsoft Teams Calling, Phone System, and Audio Conferencing services improves from 99.9% to 99.99%. Put another way, “Microsoft guarantees end users will be able to initiate a PSTN call, dial into conference audio via the PSTN, or process calls with Call Queues or Auto Attendant at least 99.99% of the time.”
The new SLA is documented in the December 2021 worldwide Service Level Agreement for Microsoft Online Services.
Meeting an SLA of 99.99% means that a service can only be unavailable for 4 minutes and 22 seconds in a month. The equivalent measure for the previous 99.9% SLA is 43 minutes 49 seconds, so Microsoft promises a considerable improvement in availability. Guaranteeing uptime is part of Microsoft’s campaign to convince customers to embrace the Teams Phone system. The big thing is that Microsoft promises to compensate customers if the service fails to meet the SLA. When this happens, Microsoft calculates the amount of lost time and applies the payment scheme outlined in Table 1. In other words, if Teams Phone achieves the old 99.9% threshold, Microsoft will compensate customers with a 25% service credit (applied towards future charges for Teams Phone).
|SLA Achieved||Threshold (see https://uptime.is/)||Microsoft Service Credit|
|< 99.99%||4 minutes 22 seconds||10%|
|< 99.9%||43 minutes 49 seconds||25%|
|< 99%||7 hours 18 minutes 17 seconds||50%|
|< 95%||1 day 12 hours 31 minutes 27 seconds||100%|
Incidents, Lost Minutes, and Calculations
An incident means a single or set of events that results in downtime, or when a service is unavailable to end users. Microsoft’s documentation describes several limitations when their guarantee of availability does not apply to all services. These limitations range from acts of terrorism to issues resulting from inadequate bandwidth or use of hardware or software not provided by Microsoft.
Assuming that a problem occurs during normal operation, service-specific terms become applicable. For Teams, an incident is when end users cannot:
- Start a PSTN call to a landline or mobile phone.
- Dial into Teams online meetings using PSTN numbers.
- Use the Teams Phone system to process calls with the Queues or Auto Attendant features.
Microsoft calculates the monthly uptime percentage (the SLA threshold) with the formula:
User minutes is the total number of minutes in a month less any scheduled downtime, or 43,200 for a 30-day month. To determine the total number of available user minutes, multiply the value by the total number of licensed users. Let’s assume that a tenant licenses 1,000 users for Teams Phone, so the figure for user minutes is 43,200,000. Because people don’t work 24 hour days, seven days a week, the number of minutes when people might actually use the service is much lower. This is one reason why a service can attain a high SLA.
Let’s assume that the tenant has an outage during which 300 of their users cannot make calls to PSTN numbers for 15 minutes, the number of downtime user minutes is:
15 * 250 = 3,750
That sounds like a lot, and each of those minutes is painful for those who can’t make what might be very important calls, but the SLA still meets the commitment because the impact of the outage reduces it to only 99.991%. Obviously, if another outage occurs during the same calendar month, Microsoft won’t meet the SLA and will have to compensate the tenant.
To put a practical perspective on the improved SLA, the previous threshold would need a 40-minute outage affecting all users to drop the SLA under 99.9%.
Proving the claim
Of course, you considering an incident deserving to be included in the SLA has happened is one thing. Making Microsoft agree is another. The formal text says
In order for Microsoft to consider a claim, you must submit the claim to customer support at Microsoft Corporation including all information necessary for Microsoft to validate the claim, including but not limited to: (i) a detailed description of the Incident; (ii) information regarding the time and duration of the Downtime; (iii) the number and location(s) of affected users (if applicable); and (iv) descriptions of your attempts to resolve the Incident at the time of occurrence.
In other words, you have zero chance of being able to make a successful claim unless you can prove (to Microsoft’s satisfaction) that a problem happened, how long it lasted, and how many users the issue affected. In practical terms, this means that you should file a support incident immediately you’re sure that the Teams Phone system has problems together with evidence of why you think an issue is present. It’s also a good idea to document how many users are affected as this is an important piece of data for compensation claims.
Because it can take some time to recognize a problem and to conclude that it’s not a transient issue, the actual length of an outage is often longer than that formally recorded by Microsoft. It’s likely that the gap will be smaller for Teams Phone than other workloads because making calls either works or it doesn’t while caching and other software techniques can disguise outages elsewhere in Microsoft 365.
SLA No Good Without Monitoring
Nice as it is for Microsoft to provide an enhanced, financially backed SLA for Teams Phone, it means nothing unless you monitor the service and report issue promptly. There’s no point complaining that Microsoft’s service didn’t come up to standard if you don’t document problems after they occur. If you want to take advantage of the new Teams Phone SLA, make sure you keep an eye on what’s happening with calls.
Learn more about how Office 365 really works on an ongoing basis by subscribing to the Office 365 for IT Pros eBook. Our monthly updates keep subscribers informed about what’s important across the Office 365 ecosystem.