Improved Sitecore Links Database provider

I have a somewhat love and hate relationship to the Sitecore Links Database. It’s really useful in some scenarios, especially in scenarios such as integrations, batch operations etc., where a Solr index or similar might not accurate enough. The Links Database has also caused me many performance headaches over the years due to some issues with its implementation.

The main problem with the default Links Database Provider is that it performs multiple small synchronous SQL “insert” and “delete” operations in the database. As the Sitecore databases grows, so does the Links table. With many insert/delete operations, the table indexes becomes very fragmented and it slows down the system. The application performance may also be quite heavily effected if latencies to SQL Server is increased.

I’ve had on my radar for a few years to write a new one. Finally I made some time to improve the implementation a bit and put it on GitHub.

Note: You’d typically never need to use the Links Database from a site rendering perspective. Use Solr/ContentSearch instead on the public facing website. The Links Database is mainly useful from an authoring perspective.

Issues to address

The built-in SqlServer Link Database provider has a couple of issues that I wanted to address:

The indexes of the Links table rapidly becomes very fragmented.
When indexes becomes too fragmented, SQL Server may choose to not use the index. This will cause a full table scan for each query. This is very costly as this table can easily contain millions of records and the table may be queried frequently.
This table is updated basically on every item save, where Sitecore reviews all the links on the items and updates them accordingly.
Updates are done in one transaction for each saved item. Some calls to the provider compares what’s already in the database and then makes one synchronous call for every “insert” and “delete” needed. Some API calls to the provider deletes all existing link rows for the item and then adds all expected links again. This will really cause index fragmentation.
Since each SQL command is sent as a stand-alone synchronous operation, just a little bit of latency to SQL Server may have a huge impact.
As table indexes becomes fragmented, SQL Server may chose to ignore using them. When this happens, it has a severe impact on performance.

SqlServerBatchLinkDatabase

From Sitecore 9.2 (IIRC), Sitecore is shipped with a new Links provider. It performs updates in batches, basically by just appending multiple SQL statement into a long string. According to Sitecore “This improves performance considerably.” However, it has some major drawbacks:

When rebuilding the links table, it truncates the entire table and rebuilds it from scratch. This essentially means that the links database is in an inconsistent state until the rebuild is complete in full. This can take several hours. In practice, this has to be done in a “single-user” mode, as the links database is incorrect until it’s done. If a user, or a background job, performs a task relying on the Links database during the rebuild process, things can go very wrong.
Even though rebuilds are faster, it still performs a lot of inserts and deletes causing the indexes to rapidly become fragmented.

The updated implementation

My updated implementation is a hybrid of the default SQL implementation and the Batch implementation. The base class of the Links provider is the abstract class LinkDatabase. The SqlLinkDatabase implementation is a generic database implementation, as the name implies, and the SqlServerLinkDatabase implementation overrides a few methods with SQL Server specific operations.

LinkDatabase
- SqlLinkDatabase
  - SqlServerLinkDatabase
    - SqlServerBatchLinkDatabase

As there are multiple public and protected methods, calling each other on different layers, I decided to make a clean cut, only inheriting the abstract LinkDatabase base class. Some code is therefore basically just migrated over from the other implementations.

Database operations uses the same DataApi, but the way of writing brackets with {0}, {1} etc., has been removed, since this will only support SQL Server anyway. IIRC, the brackets are there for supporting other database flavors, such as Oracle etc. Having just the plain SQL statements makes it more readable.

Operations are sent in batches to SQL Server, by concatenating multiple SQL statements. This is very similar to Sitecore’s batch implementation. The main difference is that I use the batches for more operations. I also ensure those are always in a transactions per item, so that the Links database is never in an invalid state. Personally I don’t like the idea of concatenating SQL statements, but it was the best I could do in a short timeframe.

Saving items

When an item is saved, the links needs updating. The provider queries for existing links and compares it with the new links. It then performs one batch update, containing only the needed INSERT and DELETE operations. Any unchanged rows are left as-is. This reduces the number of synchronous calls to the database as well as performs fewer changes to the database on some of the API calls.

Sometimes there are multiple identical links in Sitecore. For example when the same component have multiple occurrences in a renderings field. By default, Sitecore stores each of these links individually. This results in multiple identical link instances. I don’t really see the need for having those duplicates. The only difference between them is the internal database ID, that is never exposed in the LinkItem object. In my implementation, those duplicates are now filtered out.

Querying items

The methods for getting item references is basically the same as Sitecore’s implementation. Sitecore has started using “NOLOCK” on some queries, that essentially means that SQL Server won’t make a read lock prior fetching the database rows. I don’t know if this is well tested or if some of them just happened to be there, because they are not on every statement. I’m using the “NOLOCK” directive as well on some of the queries where I think it’s safe. This probably needs more testing and some code review.

I’ve also added index “HINT” to the queries as a pain killer for fragmented indexes. This basically “forces” SQL Server to use the indexes even if/when they become too fragmented. This may also need some more testing, because it’s a bit tricky to “make” fragmented indexes in an easy way. I’ve tested using “wrong” index hints and it certainly gave a huge performance hit, as expected, so it should work just fine.

Potential Index updates

Since all Sitecore databases shares the same Links table, all queries are performed on “Database”+”ID” columns (where ID is SourceItemID or TargetItemID). By default, the ID columns have one index each, but only as a single-column index. This is probably good, but I’ve considered changing those to be based on Database+ID. This would better match the queries being executed. On the other hand, the indexes will become larger and number of variations in the database column isn’t that great (core, master web and potentially few more).

I’ve therefore also made a SQL script for this optional index update. I need to do more testing in various scenarios to see if this index change actually makes any improvements. Maybe it doesn’t. I guess it varies a bit from solution to solution as well. You can give it a try and see if it works better. Otherwise, just revert to the default indexes.

Future improvements

I’ve been thinking of reducing the size of the Links table as well, by adding the ability to ignore certain links. Let’s say you have thousands of items sharing the same workflow. Do you really need a reference in the Links table for every item representing the workflow it’s using? Such change would of course have to be made with great care.

Another improvement could be skipping SQL Server all together. I haven’t done any tests, but I guess other storage providers could potentially be faster for this kind of operations.

Code

You can grab the code from GitHub and give it a try.

Ctrl+Z

Mikael Högbergs blog