Clean up Sitecore database and avoid corrupt published content

We’ve discovered a rare issue in Sitecore Publish Service (SPS) where it may publish incorrect content to some fields. Even though I think SPS does this wrong, the root cause was inconsistent data in the master database. It turned out such inconsistency exist in most databases, even in a clean Sitecore install.

The inconsistency

If you’ve spent some time digging around the Sitecore database, you probably know that items are represented in the Items table and the content of each item field is stored in three different tables: VersionedFields, UnversionedFields and SharedFields. This corresponds to the specified scope checkboxes of each field in its templates.

So a shared field will have its data in the SharedFields table and so on. But if there are data in the UnversionedFields table, connected to a shared field, we have an inconsistency. Usually this isn’t a major problem. The surplus data just sits there doing nothing. It won’t be used and no one sees it.

So why can such inconsistency happen in the first place? What first comes to mind is changing a field scope from let say shared to versioned, where there are already items with content, based on that template. Well, Sitecore already thought about this and cleans this up. However, this is done by the events raised during save. If a field item is deserialized, those events are not raised, and no cleanup will be performed. This can happen when you share serialized items within a team, using for example TDS or Unicorn, or when you’re installing update packages etc.

If you want to try this yourself, you can simply add a shared field to the “Sample Item” template in a clean Sitecore instance. Serialize the template to disk and then change the field scope to unversioned. Add some data to the field on the Home item. Then revert the field item by deserializing it again and add some other content to the field on the Home item. You’ll now find both the shared and unversioned content in the database.

Why care about this?

As I said previously, the surplus data won’t do anything other than consuming some bytes of disk space. Or so I thought. My great colleague John found that SPS may start using this data. It turned out that SPS (at least version 3.1.3) may load data from the wrong table. In the example above, SPS will pick up the unversioned field value from the master database and publish it to the shared table in the web database.

It gets even more strange. If the shared field is modified, SPS will publish the correct value. But if some other field on the item is modified, but the shared field in the example, is left untouched, the content from the unversioned table is published. Oops!

How to fix it

Sitecore has the good tool /sitecore/admin/DBCleanup.aspx that can fix some common errors in the database. But it doesn’t cleanup the surplus data described here. So I’ve written a short SQL script that will find such invalid content in the database. You can grab it here: https://gist.github.com/mikaelnet/a57faa9a7f9b35b9bba1118787b4266d

The snippet, as-is, will just list all invalid database rows. After verifying, you can replace “SELECT *” with “DELETE” in those statements to clean up those records.

At the time of writing this, I don’t know if this is something that will be fixed. I’ve reported it to Sitecore, but I have not asked for a hotfix to SPS. I see this as an edge case that is easily fixed with the script above. But I’ve suggested it to be fixed in future versions, as well as including the described field cleanup in future versions of DBCleanup.

One thought on “Clean up Sitecore database and avoid corrupt published content

  1. Pingback: Sitecore field data corruption – C# and stuff

Comments are closed.