Exchange Online: Mailboxes at Scale
Exchange Online uses Native Data Protection to avoid the need for backups. Given that the service spans more than 175,000 mailbox servers storing 1.1 exabyte of data, Microsoft’s desire to avoid backups is understandable. At this scale, managing backups and responding to the inevitable requests for restore would be a huge undertaking.
Enter Native Data Protection
Along with features like Single Item Recovery (SIR – enabled for all Exchange Online mailboxes), the Database Availability Group (DAG) is a pillar of Native Data Protection. Every Exchange Online mailbox is in a database with four copies spread across multiple datacenters. One of the database copies is lagged. Microsoft says: “The lagged database copy is not intended for individual mailbox recovery or mailbox item recovery. Its purpose is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption.” In other words, don’t ask Microsoft to recover mailbox items from the lagged database copy. Other methods exist to make sure that items deleted in error can be restored, like the Recoverable Items folder or retention policies.
DAGs depend on log shipping between mailbox servers to keep database copies synchronized. Microsoft introduced the DAG in Exchange 2010 and have improved its resilience and dependability over the three major releases shipped since.
Backup FUD in Office 365
Even with ten years of experience with DAGs, this doesn’t stop some FUD being spread to justify the need for database backups. Take this statement from a white paper distributed by a backup vendor:
…”data is replicated in near real-time between datacenters to ensure very high availability. No snapshots or backups are performed. The drawback of this approach is that any corruption is also replicated with no rollback possible. For legal discovery, this means that messages can be lost…”
The white paper covers Office 365 and is not specific whether the statement applies to SQL (for SharePoint Online and OneDrive for Business), the Azure data services used for Teams and Planner, or Exchange Online (Office 365 stores messages in all these repositories). But that’s the nice thing about FUD: throw something out that’s non-specific and hope that the dirt lands someone interesting.
History of Exchange Corruption
If this was 2002, the assertion that corruption can lead to message loss in Exchange might be true. Those of us who remember the joys of -1018 errors in the Exchange database and the need to run the ESEUTIL utility afterwards to rebuild the database can certainly attest to the woe that logical or physical corruption can wreak on Exchange.
Using Database Copies to Fix Corruption
But the situation has improved enormously since the introduction of the DAG and the Exchange mailbox servers running in the cloud don’t replicate corruption to cause data loss. Bad database pages do occur and if this happens in a single-copy database it can lead to data loss. To avoid this issue, the DAG includes a page patching mechanism to detect and recover from corruption. Here’s some text from my Microsoft Exchange Server 2013 Inside Out: Mailbox and High Availability book.
If the Store detects a problem page in the active database, it places a marker in the log stream (in the current transaction log) that acts as a request for a valid copy of the corrupted page. The request is sent to all database copies, where it is inspected and processed along with other log content. When the Information Store replays data for the passive copy, it notices the marker and responds to the request by invoking a replication service callback to ship a copy of the page to the server that hosts the active database. When this server receives the replicated page, the Store patches it back into the active database to remove the corruption. Other servers that host passive copies might also respond with pages, but these are ignored after the active database has been restored to good health.
The process to fix a corrupted page in a passive database copy is slightly different. In this case, the server that hosts the passive copy immediately pauses log replay. Log copying continues to ensure that all the transaction logs that will eventually be required to bring the database completely up to date are available on the server. The server then requests a copy of the corrupted page from the server that hosts the active database, using the internal ESE seeding mechanism. The active server responds with the page data. The passive server then waits until all the log files necessary to bring it up to date past the point at which the active server provided the page (as indicated by the maximum required generation) have been copied and inspected. When it is sure that all the required data is available, the passive server then restores the corrupt page and resumes log replay to clear the backlog of transaction logs that have accumulated since the corruption was first detected.
In addition to background database scanning, page patching is used by other resiliency features baked into the Information Store, like lost flush detection. Obviously, a feature like this works best when multiple database copies exist to service the request for good pages. That’s why Exchange Online runs with four database copies, one of which is lagged (7 days behind the active copy).
Other features that contribute to avoiding physical or logical corruption in Exchange Online mailbox databases include single bit correction and consistency checking of transaction logs before they are replayed into passive database copies. Exchange Online also uses the ReFS file system for its databases to reduce the chance of storage corruption. In short, there’s a lot of technology deployed to suppress the chance of physical or logical corruption creeping into a mailbox database.
Not Against Backups
Exchange Online uses servers based on Exchange 2019, two full versions past Exchange 2013. Information about how Exchange deals with page corruption within a DAG has been available for a long time. You’d think that people who publish white papers about Office 365 and Exchange Online would take the time to understand how the technology works before concluding that corruption will cause data loss.
I’m not against backups of Exchange Online data, but only if your organization really needs them. Regulations might mandate such a need, but feelings that corruption will happen don’t. And basing an assessment on vendor-provided tommy-rot is never a good plan. If you are interested in Office 365 backups, make sure that you understand the technology, the limitations (including how to restore data into Office 365 in a usable manner), and the cost. And then make your call.
Exchange Native Data Protection is covered in Chapter 4 of the Office 365 for IT Pros eBook.