MS Purview OCR

Tatiana Slepukhin-Zamachnaia
Nov 30, 2024
4 min read

Updated: Dec 1, 2024

DLP Limitations with Non-Text Data

So, your organization has set up Data Loss Prevention (DLP) Policies to protect sensitive information. Awesome!

But here’s the problem: what if someone takes a snapshot of sensitive information and then exfiltrates it? Another issue involves PDF files.

Now, there are two kinds of PDF files—or rather, two types of information they can contain.

First, there are text-based PDFs, where the text is digitally encoded as characters, making it selectable and searchable. Then, there are image-based PDFs, where text is stored as graphical representations, essentially images of text.

You can test if a PDF is text-based by selecting the text with your cursor. If you can highlight it, it’s text-based. Another way to confirm is by copying and pasting the text into a text editor.

Your DLP Policies can handle text-based PDFs but won’t work on image-based PDFs without OCR enabled.

Even with text-based PDFs, if the file uses proprietary text encoding, it could make the text less accessible to DLP tools.

Insiders can be clever. If they’re bad actors, they’ll probably know these limitations and won’t hesitate to use images to bypass information protection.

OCR to the rescue

Here’s where OCR (Optical Character Recognition) comes to the rescue!

By enabling OCR in Microsoft Purview, your Data Loss Prevention (DLP) Policies gain the ability to detect sensitive information within image-based PDFs and images.

With OCR enabled, Microsoft Purview can extract and analyze text embedded in images, making it nearly impossible for bad actors to slip through unnoticed.

OCR Is supported for the following M365 workloads:

· Exchange

· Teams

· SharePoint

· OneDrive for Business

· Windows Devices

Currently the following file types are supported: JPEG, JPG, BMP and PNG.

Keep in mind, though, that OCR cannot read handwritten text—it can only recognize machine-typed text or printed text in images.

Configure OCR

Go to the Microsoft Purview Portal, select Settings, and then select Optical Character Recognition (OCR).

You can see that the option to enable OCR is greyed out in my tenant. This is because billing is not set up—I’m using a free Developer Tenant for this video. Normally, an organization would have billing set up, and you could enable it here.

OCR Estimates

Microsoft provides a free estimate tool, which is very handy, particularly if you have a lot of images.

You can try it for free by clicking this button.

When the estimates are available, you will see the "View estimations" button:

Click on it to go to the Estimates dashboard:

I have 220 images in my Tenant, so the estimated charges are 0.22 dollars:

Important note about graphic PDF files – at present, the estimates for this files are not supported in SharePoint and OneDrive. Additionally, keep in mind that each page within a PDF file is counted as one distinct image. So, if you have one graphic-based PDF file that contains 90 pages, you will be charged for 90 images.

Microsoft Purview’s OCR charges are based on the number of unique images scanned. Once scanned, the results are reused, regardless of how many policies, users, or activities involve the image, ensuring no duplicate charges.

Once you start the estimation process, estimates will be calculated daily until you explicitly stop it. The caveat here is that OCR and the OCR Cost Estimator can’t run simultaneously. So, if you’ve already enabled OCR and rely on it, make sure to stop the estimation process first.

Select More options here, and then click on Stop estimation.

You can always restart the estimation process; however, make sure to download the current report first.

When you start a new estimation, all existing data on the dashboards will be wiped out.

To download the current estimates, go to the Estimates dashboard and select Download Report to save the data in CSV format.

Here is the example of the CSV file:

OCR Limitations

The limitations of OCR that you need to be aware of:

The maximum supported image size is 50 MB.
The minimum image dimensions are 50 x 50 pixels, and the maximum dimensions are 16K x 16K pixels.
Zipped archives cannot be scanned.
OCR cannot scan images embedded within Microsoft Word documents.

Some of you might be wondering: after enabling OCR, do you need to modify your existing DLP Policies? The answer is no— existing DLP Policies will automatically start scanning images.

OCR PowerShell Commands

There are OCR PowerShell cmdlets available, but at this time, Microsoft doesn’t offer any documentation on their usage—I’m sure it’s coming. In the meantime, I’ll show you a trick to find PowerShell cmdlets for newly released features or any features, for that matter.

Connect to MS Purview:

Import-Module ExchangeOnlineManagement

Connect-IPPSSession

Then, try searching for anything related to OCR, like this:

Get-Command | Where-Object {$_.Name -like "*OCR*"}

You’ll get some unrelated commands that include “OCR” in their names, but you’ll also see the following relevant ones:

Fetch current OCR Configuration/Settings:

Get-OcrConfiguration

Create new OCR Configuration and configure settings:

New-OcrConfiguration

Remove current OCR Configuration

Remove-OcrConfiguration

Modify an existing OCR Configuration

Set-OcrConfiguration

When creating an OCR configuration, you can enable OCR for specific locations or exclude certain locations. This is very important if you have a lot of images in your tenant, as OCR can quickly become expensive. While I showed you cost estimates for my demo tenant, it’s essentially empty. In a real organization, with many images, OCR bills can pile up fast.

For example, let’s say you have a SharePoint site where users store images of their pets or photos from corporate parties and social events. These could add up to gigabytes of images that you don’t want to scan. Who cares if someone exfiltrates a photo of your manager’s puppy?

Using either New-OcrConfiguraiton or Set-OcrConfiguration you can specify Exchange, SharePoint, or OneDrive locations to include, or you can exclude specific locations using the Exception parameter, such as SharePointLocationException.

You can then extract the Locations or Exceptions using the following:

$arrayValues = (Get-command Get-OcrConfiguration).parameters.SharePointLocations $arrayValues[0]

Note that you will get a NullReference Exception if you don’t have OCR Configured yet.

Now that you know how to configure OCR and control its costs, it’s time to configure and optimize your settings. Focus on scanning only the locations that matter and avoid wasting money on unnecessary scans.