Sitecore Remote Events Stopped Working

A colleague of mine spent several hours finding a very nasty problem in one of our Sitecore environments. Thomas did a great job on this, so all creds him. I thought we should share this finding with the community as well.

The problem we were facing was that our Content Delivery servers stopped processing remote events, such as html cache clearing, index updates etc. This seems to be a quite common problem caused by incorrect server configuration and is easy to find when the error is consistent. Our problem was a bit different, but I’ll get to that later on.

A common configuration error is getting the instance names incorrect. The InstanceName setting is empty by default, meaning that it will combine the server name and the IIS website name. This is fine for most scenarios, but your publishing instance, typically the Content Management server, needs a unique name that all other servers can refer to. I typically create two config files, like this:

A InstanceName.config file goes onto the publishing instance only:

<?xml version="1.0" encoding="utf-8"?>
<configuration xmlns:set="http://www.sitecore.net/xmlconfig/set/">
  <sitecore>
    <settings>
      <setting name="InstanceName" set:value="CM" />
    </settings>
  </sitecore>
</configuration>

A PublishingInstance.config file goes onto all the servers:

<?xml version="1.0" encoding="utf-8"?>
<configuration xmlns:set="http://www.sitecore.net/xmlconfig/set/">
  <sitecore>
    <settings>
      <setting name="Publishing.PublishingInstance" set:value="CM" />
    </settings>
  </sitecore>
</configuration>

Another common problem we’ve seen is when restoring a production database into dev/test/qa environments. What can happen is that the EQStamp properties comes out of sync or the EventQueue is too big etc. An easy fix for this is to just remove the EQStamp rows from the Properties tables.

Now to our odd problem that at least I haven’t heard of before. We had all the servers correctly configured and the remote event would run for a few minutes and then it just silently stopped. Nothing in the log files, even in debug mode. An application pool recycle would get the event queue going again for a while and then it would stop again.

It turned out to be caused by some really bad code in Sitecore.Kernel. Looking at the Sitecore.Search.IndexLocker constructor. It contains this piece of code:

if (!this.lockInstance.Obtain(int.MaxValue))
    throw new Exception("Could not obtain lock " + this.lockInstance);

Guess what, that exception will never the thrown if the code can’t obtain a lock. The int.MaxValue is used as timeout in milliseconds, so it’s effectively almost 25 days…

The lock it was trying to obtain turned out to be a Lucene write lock (despite we’re using Solr), and there turned out to be an old lock file in the Data/indexes/__system folder. That’s an old legacy index used for internal search and seems to be finally gone in the latest version.

So what does this have to do with remoting? Well, there is this Sitecore.Services.Heartbeat class that loops over all registered agents. It turns out this isn’t very robust. The entire heartbeat work loop stops when it gets stucked on one of the agents, like what happened above. That effectively stopped them all.

So, if you’re facing this issue too, stop the website, remove any existing *.lock files in the Data/indexes subfolders, and restart the website again.