Sitecore indexes is very powerful for getting various items fast, especially when they are located in various places. There are also some pitfalls that one needs to be aware of. This post covers methods that could be considered in some scenarios. In this post I’ll describe how I reduced indexing time from around three hours down to two minutes for a specific scenario.
The Scenario
In previous posts, I’ve described the delays that occurs between content publish and content indexing. There is a time period between the publisher has written content to the web database and indexing of the content piece has been performed, as it’s performed by background jobs, driven by the event queue. Large publish operations may also trigger a full index rebuild, so it can take some time until indexing is complete.
This delay caused a lot of trouble in a solution I’ve been working on. The multi-site solution contained news articles in many places in the content tree, so listing news articles by traversing the item tree was not an option. Using Content Search is a great solution in this kind of scenario. However, the entire solution is so large that a full index rebuild takes around three hours, even though it has been optimised as much as possible. Publishing, let’s say a scheduled press release, and then face the risk of having to wait a couple of hours before it shows up on the site was obviously not acceptable.
Speeding up indexing
Rebuilding the web index will always take some time, even after optimising the index configuration as much as possible (see previous post, working with content search and Solr in Sitecore 9). I therefore wanted to try having a separate index that would only contain the needed items to generate those news lists on the site. Since indexes are rebuilt by background jobs in parallel, it doesn’t matter if the main web index takes several hours to build, as long as my new index builds a lot faster.
A built in feature of the SitecoreItemCrawler
is its ability to specify a root folder from where it should start crawling. It’s also possible to specify multiple crawlers within the same index. This is useful when you for example wants to index the content tree and the media library tree, or in some other way have a limited set of roots. However, in my scenario, the news article items where spread out in various location in a content tree containing more than 500,000 items. So just specifying one or a few roots wouldn’t do the job.
I realised that if I can make the index contain only the valid item types, the indexing will be a bit faster. I also realised that some sub tree doesn’t need crawling either. For example, in my scenario I know that no indexable news articles will be descendants of product sections etc. Neither will there be such content beneath some asset folders and so on.
So I decided to create a very simple filtered version of the SitecoreItemCrawler
. My filtered crawler version has the following configuration additions:
- The
IncludeTemplate
list will configure what items will be processed and added to the index based their templates (or base templates). The crawler will still load non-matching items from the database, but it won’t do the processing nor send anything to Solr. So some work reduced there. - The
StopTemplate
list tell the crawler to ignore the configured branches. I.e, when the crawler runs into an item based on one of the specified templates (or base templates), it will process that item (if it’s in the IncludeTemplate list), but it won’t continue traversing its children/descendants. Thereby, large portions of an item tree can be fully ignored by the crawler. - The
StopItem
list works exactly as StopTemplate, except that it specifies items rather than any item based on given templates. The StopItem list can contain either specific item ID’s, or generic item names. Specifying item names can be useful if there are multiple items sharing the same name.
With this item crawler, I managed to build my new index in just two minutes. Thereby I can ensure news articles, press releases etc will show up very soon after a publish. The standard web index still takes long time to build of course, but that is ok as long as it’s only used where a long delay is acceptable, such as driving the sitemap.xml etc.
You can grab the sample code here, including some config examples: https://gist.github.com/mikaelnet/0413e9c8ca7df15f8073d6ec3ed3191a
Hello Thank you for the solution. I have a question on the below line of code on what it does. could you please on what it does ? What does this inheritstemplatetemplate does ? Is this some kind of extension that you have used?
if (_stopTemplates.Any(st => item.Template.InheritsTemplate(st)))
Thanks! Yes, it’s an extension method that is identical to Item.DescendsFrom that was introduced in Sitcore 9.1. I’ll update the post with it.
Thank you Mikael for the quick response. This really helped me with one of the implementations that i am working on Sitecore 8.2. One last question just for my own check. the _excludedTemplates that you are mentioned should also be part of the index definition correct ? This is not the same as the what we define in the indexconfiguration ->documentoptions correct ?
Hi,
The _excludedTemplates is just an internal representation that lets the crawler learn what templates to exclude. This is because the Item.DescendsFrom() call can be a bit costly (in comparison). So if the crawler has once found that a template doesn’t need to be indexed, it remembers that ID for future processing.
Thank you so much for the detail again and very helpful.