Understanding Partially Indexed Exchange Online Messages and Attachments

New Information Surfaces All the Time

It’s amazing what is available online. When refreshing Chapter 20 of the Office 365 for IT Pros eBook, I found an interesting Microsoft article from 2018 about how to investigate partially indexed Exchange Online items and what this means for content searches.

Why Some Exchange Online Items Aren’t Fully Indexed

Microsoft 365 indexes items as they are added to workloads like Exchange Online, SharePoint Online, and OneDrive for Business and the compliance records generated for Teams and Yammer. The indexes are the foundation for content searches and eDiscovery cases.

Microsoft 365 encounters problems when indexing a percentage of items in a tenant for reasons like an unsupported format, their size, or an error occurs during indexing. For instance, indexing can only process Excel attachments if they are under 4 MB (this page says that the first 4 MB is indexed but the remainder is not), while indexable attachments for other file types can be up to 150 MB. And some files, like graphic and audio files, have metadata (document or email properties such as subject, creation date, author, etc.) but no indexable content. In the fun fact category, the maximum body size of an item in the index is 67 million characters (including up to 250 attachments).

Items which the indexer cannot fully process are referred to as partially indexed (their metadata is indexed, but their content is partially or not indexed). Content search results report these as unindexed items. For example, the search in Figure 1 shows that 3,928 items were found in a search returning 1,462,570 items, or around 0.27%.

Viewing the results of an Office 365 content search
Figure 1: Viewing the results of an Office 365 content search

Even if their content can’t be indexed, content searches do return partially indexed items if matches occur against their metadata. You can export the partially indexed items found by a content search to perform further analysis to determine if they are of interest to an investigation. Sometimes human beings can make more sense of unindexed items than computers can.

Analyzing Partially Indexed Items with PowerShell

The article about investigating the number and type of partially indexed Exchange Online items in a tenant includes a PowerShell script (PartiallyIndexedItems.ps1), which performs a content search for all items in all Exchange Online mailboxes (including user, group, and shared mailboxes). The output of the script is some summary data about the number and size of mailbox items and the ratio of partially indexed items. As you can see, the script reported 0.27% of items in my tenant are partially indexed (also the data reported in Figure 1). In terms of file size, the ratio is higher at 2.51%, which implies that most of the items are attachments.

===== Partially indexed items =====
         Total          Ratio
Count    1,462,570      0.27%
Size(GB) 95.35          2.51%

The next step is to report the reasons why items are partially indexed and the file type of those items. The script generates output like this:

===== Reasons for partially indexed items =====
attachmentrms
     => 25
parserencrypted
    encoffmetro => 24
    pdf => 5
    xls => 4
    zip => 2
parsererror
    doc => 7
    docm => 1
    docx => 32
    encoffmetro => 8
    gzip => 37
    json => 1
    pdf => 168
    png => 2
    pptx => 11
    xml => 5
    zip => 1
retrieverrms
     => 1797

Generating More Readable Output

I amended the script to produce more readable data that also can be exported to a CSV file (see below). The error text used here is my explanation of what caused partial indexing. As you can see, most of the issues are caused by messages protected by Office 365 Message Encryption (OME) or sensitivity labels, graphic files, and zipped files.

FileType                           Count ErrorText
--------                           ----- ---------
                                    1797 Rights Management Encrypted Item
PNG graphic file                     966 Parser encountered unsupported format
Unknown/no format                    529 Parser encountered unknown format
GZIP file                            320 Parser encountered unsupported format
PDF file                             168 Parser encountered an error
Password protected PowerPoint PPTX    90 Parser encountered unsupported format
GZIP file                             37 Parser encountered an error
Bitmap graphic file                   34 Parser encountered unsupported format
Word (DOCX) document                  32 Parser encountered an error
                                      25 Rights Management Encrypted Attachment
Password protected PowerPoint PPTX    24 Parser couldn't decrypt item
PDF file                              20 Parser encountered malformed item
ZIP file                              17 Parser error on output size

You can download a copy of the modified script from GitHub.

SharePoint Online and OneDrive for Business also have partially indexed items but seem to be less prone to the problem than Exchange Online is. This is due to the higher volume of individual items flowing through email and the wide spectrum of attachments accompanying messages.

Understanding the data which exists inside an Office 365 tenant is obviously a good thing to do. The script throws some insight into the complexities of indexing for high volumes of items of different types and might explain why searches don’t always return what you might expect.

One Reply to “Understanding Partially Indexed Exchange Online Messages and Attachments”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.