Turbocharging the Analysis of Office 365 Data with PowerShell Hash Tables

The Joy of Scripting PowerShell

For the last few months, I have been dabbling with a PowerShell script to extract and report usage data for multiple Office 365 workloads from the Microsoft Graph. The idea is that by looking at activity across all the workloads reported in the Graph, you can figure out if accounts are in use and if so, what they are used for. If an account isn’t in use, then you might remove it and save some licenses.

One of the joys of PowerShell is how quickly you can put a solution together. The corollary is sometimes that the solution isn’t as efficient as it could be, which often happens when you’re not a professional programmer. When I write a script, the most important thing is often to illustrate a principle and show how something works. When PowerShell scripts are deployed into production, they’re usually upgraded and improved by programmers to meet organizational standards and fit in with other scripts used to manage the infrastructure. For this reason, I don’t bother too much with tweaking for performance.

This script is different. It’s been picked up by several tenants who reported that the script works but it’s slow when asked to process data for thousands of accounts. This deserved some investigation which produced some improvements, such as using PowerShell’s Where method to filter data.

PowerShell Hash Tables

But PowerShell is not a database and storing data about account usage in PowerShell list objects only scales so far. There are many web articles covering PowerShell performance with large amounts of data, many of which point to using hash tables because they are very efficient for finding and retrieving data (see this article about how to use hash tables).

A hash table is a collection of key/value pairs. The keys are unique, and the values are often some information associated with the key. For instance, because Office 365 objects like groups and sites store sensitivity labels as GUIDs, I often create a hash table composed of the GUID (key) and label display name (value) which I can then use to interpret the GUIDs stored in objects. Here’s what the code looks like:

Anytime I need to find the display name of a label, I can do something like this:

Apart from their usefulness in situations like described above, hash tables are very fast when you use keyed access. Speed being of the essence when thousands of records are to be processed, I decided to investigate if hash tables could replace the list objects used by the script.

Keys and Values

Finding a key is no problem because the user principal name is unique for each account. Figuring out how to store all the data in the hash table value was another matter. That is, until I noticed that: ”the keys and values in a hash table can have any .NET object type…” In other words, you’re not limited to storing simple values in a hash table.

When the script extracts usage data for a workload (like Teams or Exchange) from the Graph, it processes each record to create a list of accounts and their usage data for that workload. After some experimentation, I was able to populate the hash table by:

  • Creating an array of the usage data for the workload for an account.
  • Appending the array to the existing usage data extracted from other workloads for the account (as stored in the hash table).
  • Writing the updated array back into the hash table.

This might be inelegant, but it works. After all workloads are processed, the result is a hash table keyed on the user principal name with a value composed of an array containing the usage data for all workloads for that user. Access to the data is via the user principal name. For example:

The display is truncated here to show two of the six workload usage data extracted for an account.

Creating the report is then a matter of processing each account to extract the information and format the data. To do string comparisons and other calculations, I found that it was necessary to use the Out-String cmdlet to make the properties taken from the array into trimmed strings. It might be something to do with the way that the hash table values are stitched together from multiple arrays.

Faster Performance

After changing to hash tables, I observed a 70% performance gain in script execution time in my (small) tenant. I expect a much better gain in larger tenants where the advantages of hash table access become more pronounced. This feeling was realized in a test against 20K accounts which proved that the script is now capable of processing at circa 1,000 accounts per minute (Figure 1).

A thousand accounts a minute
Figure 1: A thousand accounts a minute

Update September 18: I received a note saying that the script processed 26,808 accounts at the rate of 3184.71 per minute!

The time required to fetch data from the Graph is the same as previous versions as is the time to prepare data for processing. All the improvement is in the report generation, which is where the hash tables excel. The tenant who processed the script against 20,000 accounts used the report (example shown in Figure 2) to identify 70 accounts assigned Office 365 E5 licenses that can now be reallocated or released (a potential saving of $29,400 annually).

Reviewing account usage to locate underused Office 365 licenses
Figure 2: Reviewing account usage to locate underused Office 365 licenses

For those who’d like to test the script and compare results in your tenant, you can download the original script and the new version from the Office365ITPros repository in GitHub. If you have a good suggestion for improving the performance further, please let comment on GitHub.


OK, we should be writing text for the Office 365 for IT Pros eBook instead of trying to work out how to speed up PowerShell scripts. But you learn a lot about an infrastructure when you program against it, so we’ll keep on scripting…

One Reply to “Turbocharging the Analysis of Office 365 Data with PowerShell Hash Tables”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.