Sitecore indexes is very powerful for getting various items fast, especially when they are located in various places. There are also some pitfalls that one needs to be aware of. This post covers methods that could be considered in some scenarios. In this post I’ll describe how I reduced indexing time from around three hours down to two minutes for a specific scenario.
In previous posts, I’ve described the delays that occurs between content publish and content indexing. There is a time period between the publisher has written content to the web database and indexing of the content piece has been performed, as it’s performed by background jobs, driven by the event queue. Large publish operations may also trigger a full index rebuild, so it can take some time until indexing is complete.
This delay caused a lot of trouble in a solution I’ve been working on. The multi-site solution contained news articles in many places in the content tree, so listing news articles by traversing the item tree was not an option. Using Content Search is a great solution in this kind of scenario. However, the entire solution is so large that a full index rebuild takes around three hours, even though it has been optimised as much as possible. Publishing, let’s say a scheduled press release, and then face the risk of having to wait a couple of hours before it shows up on the site was obviously not acceptable.
Speeding up indexing
Rebuilding the web index will always take some time, even after optimising the index configuration as much as possible (see previous post, working with content search and Solr in Sitecore 9). I therefore wanted to try having a separate index that would only contain the needed items to generate those news lists on the site. Since indexes are rebuilt by background jobs in parallel, it doesn’t matter if the main web index takes several hours to build, as long as my new index builds a lot faster.
A built in feature of the
SitecoreItemCrawler is its ability to specify a root folder from where it should start crawling. It’s also possible to specify multiple crawlers within the same index. This is useful when you for example wants to index the content tree and the media library tree, or in some other way have a limited set of roots. However, in my scenario, the news article items where spread out in various location in a content tree containing more than 500,000 items. So just specifying one or a few roots wouldn’t do the job.
I realised that if I can make the index contain only the valid item types, the indexing will be a bit faster. I also realised that some sub tree doesn’t need crawling either. For example, in my scenario I know that no indexable news articles will be descendants of product sections etc. Neither will there be such content beneath some asset folders and so on.
So I decided to create a very simple filtered version of the
SitecoreItemCrawler. My filtered crawler version has the following configuration additions:
IncludeTemplatelist will configure what items will be processed and added to the index based their templates (or base templates). The crawler will still load non-matching items from the database, but it won’t do the processing nor send anything to Solr. So some work reduced there.
StopTemplatelist tell the crawler to ignore the configured branches. I.e, when the crawler runs into an item based on one of the specified templates (or base templates), it will process that item (if it’s in the IncludeTemplate list), but it won’t continue traversing its children/descendants. Thereby, large portions of an item tree can be fully ignored by the crawler.
StopItemlist works exactly as StopTemplate, except that it specifies items rather than any item based on given templates. The StopItem list can contain either specific item ID’s, or generic item names. Specifying item names can be useful if there are multiple items sharing the same name.
With this item crawler, I managed to build my new index in just two minutes. Thereby I can ensure news articles, press releases etc will show up very soon after a publish. The standard web index still takes long time to build of course, but that is ok as long as it’s only used where a long delay is acceptable, such as driving the sitemap.xml etc.
You can grab the sample code here, including some config examples: https://gist.github.com/mikaelnet/0413e9c8ca7df15f8073d6ec3ed3191a