New Information Surfaces All the Time
It’s amazing what is available online. When refreshing Chapter 20 of the Office 365 for IT Pros eBook, I found an interesting Microsoft article from 2018 about how to investigate partially indexed Exchange Online items and what this means for content searches.
Why Some Exchange Online Items Aren’t Fully Indexed
Microsoft 365 indexes items as they are added to workloads like Exchange Online, SharePoint Online, and OneDrive for Business and the compliance records generated for Teams and Yammer. The indexes are the foundation for content searches and eDiscovery cases.
Microsoft 365 encounters problems when indexing a percentage of items in a tenant for reasons like an unsupported format, their size, or an error occurs during indexing. For instance, indexing can only process Excel attachments if they are under 4 MB (this page says that the first 4 MB is indexed but the remainder is not), while indexable attachments for other file types can be up to 150 MB. And some files, like graphic and audio files, have metadata (document or email properties such as subject, creation date, author, etc.) but no indexable content. In the fun fact category, the maximum body size of an item in the index is 67 million characters (including up to 250 attachments).
Items which the indexer cannot fully process are referred to as partially indexed (their metadata is indexed, but their content is partially or not indexed). Content search results report these as unindexed items. For example, the search in Figure 1 shows that 3,928 items were found in a search returning 1,462,570 items, or around 0.27%.
Even if their content can’t be indexed, content searches do return partially indexed items if matches occur against their metadata. You can export the partially indexed items found by a content search to perform further analysis to determine if they are of interest to an investigation. Sometimes human beings can make more sense of unindexed items than computers can.
Analyzing Partially Indexed Items with PowerShell
The article about investigating the number and type of partially indexed Exchange Online items in a tenant includes a PowerShell script (PartiallyIndexedItems.ps1), which performs a content search for all items in all Exchange Online mailboxes (including user, group, and shared mailboxes). The output of the script is some summary data about the number and size of mailbox items and the ratio of partially indexed items. As you can see, the script reported 0.27% of items in my tenant are partially indexed (also the data reported in Figure 1). In terms of file size, the ratio is higher at 2.51%, which implies that most of the items are attachments.
===== Partially indexed items ===== Total Ratio Count 1,462,570 0.27% Size(GB) 95.35 2.51%
The next step is to report the reasons why items are partially indexed and the file type of those items. The script generates output like this:
===== Reasons for partially indexed items ===== attachmentrms => 25 parserencrypted encoffmetro => 24 pdf => 5 xls => 4 zip => 2 parsererror doc => 7 docm => 1 docx => 32 encoffmetro => 8 gzip => 37 json => 1 pdf => 168 png => 2 pptx => 11 xml => 5 zip => 1 retrieverrms => 1797
Generating More Readable Output
I amended the script to produce more readable data that also can be exported to a CSV file (see below). The error text used here is my explanation of what caused partial indexing. As you can see, most of the issues are caused by messages protected by Office 365 Message Encryption (OME) or sensitivity labels, graphic files, and zipped files.
FileType Count ErrorText -------- ----- --------- 1797 Rights Management Encrypted Item PNG graphic file 966 Parser encountered unsupported format Unknown/no format 529 Parser encountered unknown format GZIP file 320 Parser encountered unsupported format PDF file 168 Parser encountered an error Password protected PowerPoint PPTX 90 Parser encountered unsupported format GZIP file 37 Parser encountered an error Bitmap graphic file 34 Parser encountered unsupported format Word (DOCX) document 32 Parser encountered an error 25 Rights Management Encrypted Attachment Password protected PowerPoint PPTX 24 Parser couldn't decrypt item PDF file 20 Parser encountered malformed item ZIP file 17 Parser error on output size
You can download a copy of the modified script from GitHub.
SharePoint Online and OneDrive for Business also have partially indexed items but seem to be less prone to the problem than Exchange Online is. This is due to the higher volume of individual items flowing through email and the wide spectrum of attachments accompanying messages.
Understanding the data which exists inside an Office 365 tenant is obviously a good thing to do. The script throws some insight into the complexities of indexing for high volumes of items of different types and might explain why searches don’t always return what you might expect.