Thoughts on indexing Sitecore content with Solr

Today, my colleagues stumbled upon a strange behavior in one of our projects. I won’t go into all the details about it, but it let me think about how Sitecore indexes fields and how it can be improved. We use Solr, so at this stage, I don’t know if this applies to Lucene as well, but I guess it does.

We typically auto generate our domain model from a TDS project using a T4 template. Having that, we can easily map all Sitecore template fields to typed properties in our domain model and we can also map those to the fields in a Solr index. This means we can use the same model when using the ContentSearch API as well as with the traditional API. We can also lazy load fields from the original item if they are not marked as stored in the index.

Part of today’s strange behavior was that html and rich text fields are not stored properly, at least not in my opinion. Looking at the Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config file, we see that text fields, such as “html”, “rich text”, “single-line text” etc are treated as “text” field types. Those are then typically configured with stemming and other text analyzers in Solr.

Further down that config file, we have field readers, such as NumericFieldReader, DateFieldReader etc, that converts the raw value of a Sitecore field into a index suitable value. Here we find that “html” and “rich text” is using a RichTextFieldReader. That little class gets the plain text from a rich text field and thereafter it’s treated as any other text string.

I think that’s not a very good idea for two reasons. First, the html stripper is really simple. It basically just removes <tags>, so for example a script block fall through. Secondly, since it’s treated as a text field, it’s stored in the index by default. But since it’s a processed value by the RichTextFieldReader, it’s just a complete waste of index space. It can’t be used for anything since it’s no longer its original value.

To solve this, we can change this to a separate dynamic field type in the Solr config, that works similar to the text field type, but has an html analyzer, such as the solr.HTMLStripCharFilterFactory. That one is more advanced than the RichTextFieldReader and by moving the html analyzing part to the Solr server, we can make use of a correct index stored value (or easily opt-out all those fields), since we can send the original value from Sitecore using the DefaultFieldReader instead. We also ease the load of the web server a little bit.