Indexing and OCR scanning PDF documents in Sitecore

PDF documents in Sitecore media library can be indexed using IFilters, but it has faced its limitations regarding Azure support etc and isn’t very effective from a performance point of view. The way the extracted content is indexed also makes it harder to use in multi-language solutions.

I’ve taken a different approach on indexing PDF documents, making it more accurate and increase the performance at the same time. The IFilter approach is a generic approach, supporting multiple file formats. I’ve focused on PDF documents in this post, as it’s a common format. Similar principles can be applied to other file formats as well.

In this post:

  • Avoiding heavy computation during index time
  • Extracting document content through PDF libraries
  • OCR scanning of image/non-text based PDF documents
  • Indexing documents with language stemming

Avoiding heavy computation

By default, Sitecore extracts content from files during index time. This is a quite heavy process where the whole binary document needs to be loaded from the database, parsed and its text content is extracted as part of a computed field. This process will be done for every version, on every language, every time an PDF item is being indexed. And the result will always be the same, so why do it over and over again?

The only time content extraction is really needed is when a PDF is uploaded and when a new binary file is attached to an existing item (i.e. the file is replaced). We can add a new field to media files that can hold the extracted content, thereby avoiding parsing the file every time.

Add a shared Multi-Line Text field, such as DocumentContent, to the unversioned File template ({962B53C4-F93B-4DF9-9821-415C867B8903}). As I’m not using versioned files myself, I haven’t included support for this here, but it can easily be added by adding the same field as a versioned field to the versioned File template. ({611933AC-CE0C-4DDC-9683-F830232DB150}) and adjust the code below accordingly. I’d also recommend limiting the editor access to this field by denying Field Read and Field Write to Everyone (ar|sitecore\Everyone|pe|-field:write|-field:read|)

We can now hook into the two pipeline processors that are used when uploading/attaching new files, like this:

<?xml version="1.0" encoding="utf-8" ?>
<configuration xmlns:role="http://www.sitecore.net/xmlconfig/role/">
  <sitecore role:require="ContentManagement or Standalone">
    <processors>
      <attachFile>
        <processor mode="on" type="MyNamespace.AttachFileAddMetaData, MyAssembly"/>
      </attachFile>
      <uiUpload>
        <processor mode="on" type="MyNamespace.UploadFileAddMetaData, MyAssembly"/>
      </uiUpload>
    </processors>
  </sitecore>
</configuration>

Since the two pipelines takes different arguments, we need two slightly different processors to deal with the two:

namespace MyNamespace
{
    public class AttachFileAddMetaData
    {
        public void Process(AttachArgs args)
        {
            if (args?.MediaItem == null)
                return;

            var item = args.MediaItem.InnerItem;
            PdfHelper.LoadItemWithMetaData(item);
        }
    }

    public class UploadFileAddMetaData
    {
        public void Process(UploadArgs args)
        {
            if (args == null)
                return;

            foreach (var item in args.UploadedItems)
            {
                PdfHelper.LoadItemWithMetaData(item);
            }
        }
    }
}

Extracting PDF content

The next step is to actually extract the PDF content and store it into our new field. I’ve used iTextSharp in this example as it’s almost a single-liner, but other libraries can be used as well. I’m extracting both the body content and the PDF meta data, such as Title and Keywords. Please note that iTextSharp comes with a commercial license. Consider using PDFsharp instead, as it comes with the Sitecore product from version 9.1. PDFsharp needs a little bit more code to extract the content itself from the PDF documents, but stackoverflow will show you how.

While experimenting with this, I found that the Sitecore Content Editor doesn’t like large text blocks. And to be fair, Sitecore can’t really be blamed for this. If the text body of a PDF is very large, such as a megabyte or more, extracting it into a Multi-Line text field, also means it has to be downloaded to the editing browser and rendered into a HTML form every time the item is opened. To avoid this, and also for content relevance reasons in my scenario, I decided to cut of very long documents at 64k.

An alternative way could be to use the Blob field type instead of a Multi-Line Text for the extracted content. That would require some adjustments to how the extracted content is stored on the item, but I’ve kept it simple here in favor of more readable code.

namespace MyNamespace
{
    public static class PdfHelper
    {
        public static void LoadItemWithMetaData(Item item)
        {
            MediaItem mediaItem = item;
            var ext = mediaItem.Extension.ToLowerInvariant();
            if (ext != "pdf" &amp;&amp; mediaItem.MimeType != "application/pdf")
                return;

            string title, keywords, subject, content = null;
            try
            {
                using (var reader = new PdfReader(mediaItem.GetMediaStream()))
                {
                    var info = reader.Info;
                    title = GetInfoValue(info, "Title");
                    keywords = GetInfoValue(info, "Keywords");
                    subject = GetInfoValue(info, "Subject");

                    if (item.Fields["DocumentContent"] != null)
                    {
                        var textWriter = new StringWriter();
                        var strategy = new SimpleTextExtractionStrategy();
                        int length = 0;
                        int cutOffLength = Settings.GetIntSetting("PdfTextCutOffLength", 64000);
                        for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
                        {
                            var pageText = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
                            ExtractText(new StringReader(pageText), textWriter, cutOffLength, ref length);
                            if (length >= cutOffLength)
                            {
                                Log.Info($"Breaking PDF content after {length} characters", nameof(PdfHelper));
                                break;
                            }
                        }
                        content = textWriter.ToString();
                    }
                }
            }
            catch (Exception ex)
            {
                Log.Error("Unable to load PDF data from " + item.Paths.Path, ex, nameof(PdfHelper));
                return;
            }

            // Write the shared content field
            MapMediaItemField(item, "DocumentContent", content);
            if (item.Editing.IsEditing)
            {
                item.Editing.EndEdit();
            }

            // Write all version fields (Adjust this for versioned files)
            foreach (var itemVersion in item.Versions.GetVersions(true))
            {
                MapMediaItemField(itemVersion, "Title", title);
                MapMediaItemField(itemVersion, "Keywords", keywords);
                MapMediaItemField(itemVersion, "Description", subject);

                if (itemVersion.Editing.IsEditing)
                {
                    itemVersion.Editing.EndEdit();
                }
            }
        }

        private static string GetInfoValue(IDictionary<string, string> info, string key)
        {
            if (info == null)
                return null;
            if (!info.ContainsKey(key))
                return null;
            var value = info[key];
            if (string.IsNullOrWhiteSpace(value))
                return null;
            return value.Trim();    // Ensure a new instance here as the PdfReader instance will go away
        }

        private static void MapMediaItemField(Item item, string fieldName, string value)
        {
            if (string.IsNullOrWhiteSpace(value))
                return;

            var field = item.Fields[fieldName];
            if (field == null)
                return;

            if (field.Value != value)
            {
                if (!item.Editing.IsEditing)
                    item.Editing.BeginEdit();
                field.Value = value;
            }
        }

        /// <summary>
        /// Strips out all delimiters and other odd characters, keeping only 
        /// the letters (including support for latin, greek,cyrillic, arabic, 
        /// japanese, chinese, korean character sets). Used for indexing etc
        /// where only the real words are important.
        /// </summary>
        /// <param name="reader"></param>
        /// <param name="writer"></param>
        /// <param name="cutOffLength"></param>
        /// <param name="length"></param>
        public static void ExtractText(TextReader reader, TextWriter writer, int cutOffLength, ref int length)
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                bool hasLineContent = false;
                bool isWhiteSpace = true;
                foreach (char c in line)
                {
                    // Accept all real letters and digits, including all languages.
                    // Ranges are picked from the Unicode Character Ranges specification
                    // Add more language ranges here as needed per solution.
                    // http://jrgraphix.net/research/unicode.php
                    if (c >= '0' &amp;&amp; c <= '9' ||
                        c >= 'A' &amp;&amp; c <= 'Z' ||
                        c >= 'a' &amp;&amp; c <= 'z' ||
                        c >= 0x00A1 &amp;&amp; c < 0x02B0 || // Extended latin
                        c >= 0x0370 &amp;&amp; c < 0x2000 || // Greek, Cyrillic, Hebrew, Arabic, Syriac etc
                        c >= 0x2E80 &amp;&amp; c < 0xE000 || // CJK, Japanese etc
                        c >= 0xF900 &amp;&amp; c < 0xFF00 )  // specials
                    {
                        if (isWhiteSpace &amp;&amp; hasLineContent)
                        {
                            writer.Write(' ');
                            length++;
                        }
                        isWhiteSpace = false;
                        writer.Write(c);
                        hasLineContent = true;
                        length++;
                    }
                    else
                    {
                        isWhiteSpace = true;
                    }

                    if (isWhiteSpace &amp;&amp; length > cutOffLength)
                        return;
                }
                if (hasLineContent)
                    writer.WriteLine();
            }
        }
    }
}

The code is pretty straight forward, but the ExtractText method may need some additional comments. Some documents may contain a lot of control characters, symbols etc that will be discarded by an indexer, such as Solr, anyway. So in that sense, those are not needed. By why bother removing them here, when Solr will do it anyway, probably in a more efficient and accurate way? Well, there are many ways of storing/encoding PDF documents. I’ve found that some formats cannot be extracted into plain text and will instead return all kinds of strange characters. By eliminating all character sets that are not included in the used languages by the solution, it becomes easier to identify such documents.

OCR scanning

When PDF documents are encoded in a format that cannot be extracted into plain text, or when documents are simply scanned documents embedded as images, we can’t use normal document parsers. So we need a different approach for such documents. I thought playing around a bit with OCR (Optical Character Recognition) would be a good solution to this.

Amazon recently announced its Textract OCR Cloud Service. It can scan images and PDF documents and extract text content as well as table and form data. The Textract service is quite cheap too at just $0.0015 per page (not per document!). However, analyzing more advanced table and form documents are more expensive.

Note: When writing this, Amazon Textract only supports English text, so it might not be usable in all scenarios.

The Amazon Textract API returns a lot of meta data around the scanned text, such as its position and its confidence rate it has interpreted the text correct. One can dig very deep into this, but I’ve kept this example very simple. Asynchronously parsing multi-page PDF documents, involving Amazon S3 storage, SQS queues and SNS topics, is complex enough.

Here’s a sample class that’ll OCR scan a PDF media item, that can be triggered from a ribbon button in the Sitecore Content Editor.

namespace MyNamespace
{
    /// <summary>
    /// Uses Amazon Textract to OCR scan PDF documents in media library.
    /// In order for the function to work, a configured Amazon account
    /// is needed.
    /// A summary of how to configure Amazon Textract for Asynchronous
    /// Operations is available here:
    /// https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html
    ///
    /// </summary>
    public class OcrScanPdfCommand : Command
    {
        private const string ConnectionStringName = "AmazonTextract";
        private const string SettingsPrefix = "OcrScanner";

        public string BucketName { get; protected set; }
        public string BucketKeyPrefix { get; protected set; }
        public RegionEndpoint Region { get; protected set; }
        protected AWSCredentials Credentials { private get; set; }
        public string SnsTopicArn { get; protected set; }
        public string SnsRoleArn { get; protected set; }
        public TimeSpan Timeout { get; protected set; } = TimeSpan.FromSeconds(120);
        public float MinConfidence { get; protected set; } = 50.0f;

        private bool _configurationRead = false;

        private string GetConfigurationParameter(IDictionary<string, string> config, string key)
        {
            return Settings.GetSetting($"{SettingsPrefix}.{key}", config.ContainsKey(key) ? config[key] : null);
        }

        public virtual void ReadConfiguration()
        {
            if (_configurationRead)
                return;

            if (!Settings.ConnectionStringExists(ConnectionStringName))
            {
                Log.SingleError($"Connection string {ConnectionStringName} not configured", nameof(OcrScanPdfCommand));
                _configurationRead = true;
                return;
            }
            var connectionString = Settings.GetConnectionString(ConnectionStringName);
            
            var config = connectionString.Split(';', '&amp;')
                .Select(p =>
                {
                    var param = p.Split('=');
                    return new Tuple<string, string>(param[0], param[1]);
                })
                .ToDictionary(d => d.Item1, d => d.Item2, StringComparer.InvariantCultureIgnoreCase);

            BucketName = GetConfigurationParameter(config, nameof(BucketName));
            BucketKeyPrefix = GetConfigurationParameter(config, nameof(BucketKeyPrefix));
            SnsTopicArn = GetConfigurationParameter(config, nameof(SnsTopicArn));
            SnsRoleArn = GetConfigurationParameter(config, nameof(SnsRoleArn));
            MinConfidence = (float) Settings.GetDoubleSetting($"{SettingsPrefix}.{nameof(MinConfidence)}", 50);
            var timeout = GetConfigurationParameter(config, nameof(Timeout));
            if (!string.IsNullOrWhiteSpace(timeout))
                Timeout = TimeSpan.FromSeconds(int.Parse(timeout));

            Region = RegionEndpoint.GetBySystemName(GetConfigurationParameter(config, nameof(Region)));

            if (config.ContainsKey("AccessKey") &amp;&amp; config.ContainsKey("SecretKey"))
            {
                // Credentials from ConnectionStrings.config
                Credentials = new BasicAWSCredentials(config["AccessKey"], config["SecretKey"]);
            }
            else if (!string.IsNullOrWhiteSpace(ConfigurationManager.AppSettings["AWSAccessKey"]))
            {
                // Credentials from web.config app settings
                Credentials = new AppConfigAWSCredentials();
            }
            else
            {
                // Credentials from EC2 machine role
                Credentials = new InstanceProfileAWSCredentials();
            }
        }

        public bool IsConfigured
        {
            get
            {
                ReadConfiguration();
                return !string.IsNullOrEmpty(BucketName) &amp;&amp;
                       !string.IsNullOrEmpty(SnsTopicArn) &amp;&amp;
                       !string.IsNullOrEmpty(SnsRoleArn);
            }
        }
        

        public override CommandState QueryState(CommandContext context)
        {
            if (context.Items == null || context.Items.Length != 1)
                return CommandState.Hidden;

            var item = context.Items[0];
            if (item == null)
                return CommandState.Hidden;

            if (!item.Paths.IsMediaItem)
                return CommandState.Hidden;

            MediaItem mediaItem = item;
            var ext = mediaItem.Extension.ToLowerInvariant();
            if (ext != "pdf" &amp;&amp; mediaItem.MimeType != "application/pdf")
                return CommandState.Disabled;

            if (!IsConfigured)
                return CommandState.Hidden;

            return CommandState.Enabled;
        }

        public override void Execute(CommandContext context)
        {
            Assert.ArgumentNotNull(context, nameof(context));
            var item = context.Items[0];

            var parameters = new NameValueCollection();
            parameters["id"] = item.ID.ToString();
            parameters["lang"] = item.Language.Name;
            parameters["db"] = item.Database.Name;

            Sitecore.Context.ClientPage.Start(this, nameof(Run), parameters);
        }

        protected virtual void Run(ClientPipelineArgs args)
        {
            Assert.ArgumentNotNull(args, nameof(args));
            var itemId = new ID(args.Parameters["id"]);
            var db = Factory.GetDatabase(args.Parameters["db"]);
            var lang = LanguageManager.GetLanguage(args.Parameters["lang"], db);

            var item = db.GetItem(itemId, lang);
            if (item == null)
            {
                SheerResponse.Alert("Item not found");
                return;
            }

            MediaItem mediaItem = item;
            var ext = mediaItem.Extension.ToLowerInvariant();
            if (ext != "pdf" &amp;&amp; mediaItem.MimeType != "application/pdf")
            {
                SheerResponse.Alert("Only PDF items are currently supported");
                return;
            }

            Sitecore.Shell.Applications.Dialogs.ProgressBoxes.ProgressBox
                .Execute("OCR Scan PDF", "OCR Scanning PDF content",
                    StartProcess, new object[] { mediaItem });
        }

        public void StartProcess(params object[] parameters)
        {
            var mediaItem = (MediaItem)parameters[0];

            Log.Info($"Amazon Textract OCR scanning of {mediaItem.MediaPath}.{mediaItem.Extension} ({mediaItem.ID})", nameof(OcrScanPdfCommand));

            try
            {
                var progressStatus = Context.Job.Status;
                progressStatus.Total = (long) (10 + Timeout.TotalSeconds);
                progressStatus.Processed = 0;

                progressStatus.Messages.Add("Sending media to Amazon Textract");
                var s3Object = UploadMediaItem(mediaItem);
                progressStatus.Processed = 4;

                progressStatus.Messages.Add("Analyzing document...");
                var amazonTextractClient = new AmazonTextractClient(Credentials, Region);
                var jobId = StartDocumentTextDetection(amazonTextractClient, s3Object);
                progressStatus.Processed += 2;

                var result = WaitForTextDetectionResult(amazonTextractClient, jobId);
                if (!string.IsNullOrWhiteSpace(result))
                {
                    progressStatus.Processed = progressStatus.Total - 2;
                    progressStatus.Messages.Add("Saving result");

                    mediaItem.InnerItem.Editing.BeginEdit();
                    mediaItem.InnerItem["DocumentContent"] = result;
                    mediaItem.InnerItem.Editing.EndEdit();
                }
                progressStatus.Processed = progressStatus.Total;
            }
            catch (Exception ex)
            {
                Log.Error($"OCR scanning of {mediaItem.MediaPath}", ex, nameof(OcrScanPdfCommand));
                SheerResponse.Alert("An error occurred when scanning documnet.");
            }
        }

        /// <summary>
        /// Waits for a content detection job to complete and returns the parsed content.
        /// </summary>
        /// <param name="client"></param>
        /// <param name="jobId"></param>
        /// <returns></returns>
        private string WaitForTextDetectionResult(AmazonTextractClient client, string jobId)
        {
            var request = new GetDocumentTextDetectionRequest {JobId = jobId};
            var response = client.GetDocumentTextDetection(request);
            var start = DateTime.UtcNow;
            while (response.JobStatus == JobStatus.IN_PROGRESS &amp;&amp; DateTime.UtcNow - start < Timeout)
            {
                Thread.Sleep(TimeSpan.FromSeconds(1));
                response = client.GetDocumentTextDetection(request);
                Sitecore.Context.Job.Status.Processed++;
                Sitecore.Context.Job.Status.Messages.Add($"Analyzing document... {DateTime.UtcNow - start:mm\\:ss}");
            }

            if (response.JobStatus == JobStatus.IN_PROGRESS)
            {
                Log.Error($"Amazon Textract OCR scanning timed out. JobID: {jobId}", nameof(OcrScanPdfCommand));
                return null;
            }

            if (response.JobStatus == JobStatus.FAILED)
            {
                Log.Error($"Amazon Textract OCR scanning failed. JobID: {jobId}", nameof(OcrScanPdfCommand));
                return null;
            }

            var sb = new StringBuilder();
            sb.Append(ExtractContent(response));

            while (!string.IsNullOrEmpty(response.NextToken))
            {
                request = new GetDocumentTextDetectionRequest {JobId = jobId, NextToken = response.NextToken};
                response = client.GetDocumentTextDetection(request);

                sb.Append(ExtractContent(response));
            }

            return sb.ToString();
        }

        /// <summary>
        /// Collects text content from the response having the
        /// configured confidence. A lot more logic may go into this method
        /// </summary>
        /// <param name="response"></param>
        /// <returns></returns>
        protected virtual string ExtractContent(GetDocumentTextDetectionResponse response)
        {
            var sb = new StringBuilder();

            foreach (var block in response.Blocks)
            {
                if (!string.IsNullOrWhiteSpace(block.Text) &amp;&amp; block.Confidence > MinConfidence)
                {
                    sb.AppendLine(block.Text);
                }
            }

            return sb.ToString();
        }

        /// <summary>
        /// Start Textract text detection job.
        /// Note: Calls to this method costs $0.0015 per page
        /// </summary>
        /// <param name="client"></param>
        /// <param name="s3Object"></param>
        /// <returns></returns>
        private string StartDocumentTextDetection(AmazonTextractClient client, S3ObjectVersion s3Object)
        {
            var request = new StartDocumentTextDetectionRequest
            {
                DocumentLocation = ConvertToDocumentLocation(s3Object),
                NotificationChannel = new NotificationChannel { SNSTopicArn = SnsTopicArn, RoleArn = SnsRoleArn }
            };
            var response = client.StartDocumentTextDetection(request);
            return response.JobId;
        }

        private DocumentLocation ConvertToDocumentLocation (S3ObjectVersion s3Object)
        {
            return new DocumentLocation
            {
                S3Object = new Amazon.Textract.Model.S3Object
                {
                    Bucket = s3Object.BucketName,
                    Name = s3Object.Key,
                    Version = s3Object.VersionId
                }
            };
        }

        /// <summary>
        /// Uploads a given media item to a S3 bucket.
        /// This is temporary storage during the Textract operation.
        /// Consider adding an object lifetime rule to the bucket to clean it up
        /// </summary>
        /// <param name="item">The media item to upload</param>
        /// <returns>The S3Object representation of the stored object</returns>
        protected virtual S3ObjectVersion UploadMediaItem(MediaItem item)
        {
            using (var memoryStream = new MemoryStream())
            {
                item.GetMediaStream().CopyTo(memoryStream);
                memoryStream.Seek(0, SeekOrigin.Begin);

                var key = $"{BucketKeyPrefix}{item.ID.Guid:D}.{item.Extension}";
                var client = new AmazonS3Client(Credentials, Region);
                var request = new PutObjectRequest
                {
                    BucketName = BucketName,
                    Key = key,
                    InputStream = memoryStream,
                    AutoCloseStream = false,
                    AutoResetStreamPosition = true,
                };
                var response = client.PutObject(request);
                return new S3ObjectVersion()
                {
                    BucketName = BucketName,
                    ETag = response.ETag,
                    Key = key,
                    VersionId = response.VersionId,
                };
            }
        }
    }
}

The service also involves configuring some Amazon AWS credentials, as well as some other settings. Settings can be embedded in the ConnectionStrings.config file, or be added as Sitecore settings. The format of the connection string is like this:

<add name="AmazonTextract" connectionString="BucketName=xxx;BucketKeyPrefix=xxx;SnsTopicArn=xxx;SnsRoleArn=xxx;Timeout=nnn;Region=xxx;AccessKey=***;SecretKey=***" />

It’s typically a good idea to put a lifetime rule on the S3 bucket, so that processed files are deleted from the bucket.

Indexing the content

As mentioned in the beginning of this post, IFilter processing is quite heavy. Therefore I typically remove the PDF processor and I usually also remove the _content computed field as well, as I don’t see much use for it in multi-language solutions. This is essentially just a performance improvement, so keep it if you need it:

<?xml version="1.0" encoding="utf-8" ?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore search:require="Solr">
    <contentSearch>
      <indexConfigurations>
        <defaultSolrIndexConfiguration>
          <documentOptions>
            <fields hint="raw:AddComputedIndexField">
              <field fieldName="_content">
                <patch:delete />
              </field>
            </fields>
          </documentOptions>

          <mediaIndexing>
            <mimeTypes>
              <includes>
                <mimeType>application/pdf<patch:delete /></mimeType>
              </includes>
            </mimeTypes>
          </mediaIndexing>
        </defaultSolrIndexConfiguration>
      </indexConfigurations>
    </contentSearch>
  </sitecore>
</configuration>

As the PDF content is now extracted into a regular Multi-Line Text field, it will also be indexed and stemmed using the current language. However, there’s no universal way of knowing what language a document is written in. Sometimes there are multiple languages embedded in the same language, such as a product manual. 

As with any other item, Sitecore will index the PDF media item using the language versions the item has. For an unversioned File item, it means it will index the shared DocumentContent field on all languages. In a multi-language solution, one need to define a set of rules of what languages should be used for each document. This could be done by structuring documents or by naming convention etc. The whole point of this is that it won’t be using the built-in _content field that’s typically just a single text_general field.

If DocumentContent is defined as a Blob field instead, one needs to add a new computed index field in order to load the content from the Blob and return it as a string at index time.

3 thoughts on “Indexing and OCR scanning PDF documents in Sitecore

  1. Hi Mikael,
    thanks for this insightful post!

    Can you elaborate, based on your experiences, how the licensing situation is with iTextSharp for use in productive solutions for registered companies?

    Thanks,
    Oliver

Leave a Reply