We tend to spend quite some time on URL management in order to get them nice, SEO friendly in all sorts of ways, but I’ve noticed that most CMS I’ve been in contact with have overseen one of the areas that I think is one of the most important ones. That is proper handling of permanent redirect (301) of URL’s when they change.
The URL structure should of course reflect the site structure, have relevant keywords in it etc. When content authors work with the site, those URL’s may change. The old ones give a “404 Not Found”. Bad from a SEO perspective but it also means current visitors on the site will have downloaded html pages with broken links. My experience is that this escalates when having integrations to other systems that generates items.
I created a solution to handle this in a proprietary CMS in 2008. Back then it was a huge load of quite complex code. I’ve now ported it into Sitecore and it was surprisingly easy to implement. The result is less than 1/10 of the original code. Sweet! (But to be fair, that old code actually had a few additional features as well.)
The idea is to keep track of all URL’s on the site and its corresponding Item ID in a database. Then we hook into the request pipeline and just before giving up with a “404 Not Found”, we look for the URL in the database. If found, we resolve the Item ID and send a permanent redirect to the new URL.
The solution basically contains three parts:
- URL publisher
- Request processor
- Cleanup agent
1. URL Publisher
The URL publisher hooks into the Sitecore publishing pipeline. At the “publish:end” event, it will scan through all the pages on the site and write the pages URL’s and Item ID’s into a database. A page in Sitecore can be accessed using many different URL’s, like different casing, with or without the .aspx extension, with or without embedded language etc. Therefore we store each URL in its most simple format. That is a lower case version, without .aspx extension and without language. (You may want to change this if you block items without language versions.)
2. Request Processor
The Request processor hooks into the Sitecore request pipeline after the ItemResolver. If the ItemResolver didn’t find an Item, we look for the URL in the database instead. If found, we try to load the corresponding Item. If found, and have a Layout etc, we send a 301 Permanent redirect to the new URL. We also update a “LastSeen” field in the database to be able to do some cleanup later. I’ve added a hit counter as well but it’s currently not in use.
3. Cleanup Agent
The Cleanup agent runs on a regular basis using the standard scheduler. Its job is to remove URL’s from the database that are no longer needed. Otherwise the database could grow needlessly. There are a number of scenarios why an old URL could be in use. Google may (hopfully) have indexed it. With a 301 redirect, google should update its index shortly. Other sites may link to your page and your users may have bookmarked it. Those may not be updated for some time. So, by updating the “LastSeen” field on each hit, we can see if an old URL is still in use. When a URL hasn’t been used for a long time, lets say three months, we can probably delete it.
I hope the module is quite self explanatory but there are a few things worth noticing. The module needs one database table to store its data. If the autoInstall property is left “true”, the module will try to ensure the database table exists upon publish. You can create the table manually and you can tune the length of the “Website” and “Url” columns as needed.
Apart from that, just review the example config file in App_Config/Include and rename it.
In complex setups, you may consider using an alternative database for storing the URL’s. Just register a separate connectionString and point the provider to that one instead.
Well, it’s not that much to say about testing. Install the module, publish the site and view it in a second browser where you are not logged into Sitecore. By design, the module will not execute in preview mode.
Change the URL of some pages (rename the items, change DisplayName or whatever is needed according to your Sitecore setup) and publish again.
Now you should be able to use reload pages with the old URL’s and they’ll redirect to their new URL’s.
Please note also that there’s a little bug in Sitecore where an incremental publish won’t publish items that you have only renamed. You need to do some kind of additional change (like clicking Save, Ctrl-S) to make sure the __Updated field gets refreshed.
The current version stores all URL’s in the web database and all URL’s are processed after each publish. This is probably a performance problem on large sites. Instead, we could trigger this on each item instead, but we’d have to generate URL publishing for all children as well. That would also add some requirements on the Cleanup agent as well, so URL’s in use doesn’t expire incorrectly.
The current URL publisher also touches a lot of records in the database, making the database log grow. We could use a simpler Key Value store instead or use a separate database with bulk transaction log. I’ve tried to follow the existing Sitecore patterns, so it should be simple to just create additional providers for this.
You can now download the module source code or a binary package from http://marketplace.sitecore.net/Modules/DeadUrls.aspx.
I hope you like it! Feel free to send me a tweet @mikaelnet if you do.