A common problem with large scale web applications

When working on large scale web application, typically hosted on multiple content delivery servers, regardless if it’s due to high load or a requirement for redundancy, there are quite a few pitfalls to handle.

One of them is on-demand, content delivery web server generated content. My most common case are ordinary web images. Images are typically selected by content authors and they don’t know anything about pixels, bandwidth etc. The people responsible of that are the web developers. The allowed image size constraints should be specified in the aspx/ascx/cshtml files. This means that you utilize some sort of function to resize the image on request of that image and you cache the result, typically as a file on the content deliver web server.

Since you’re good internet citizen, you have version control of your generated images as part of the image URL and you use long time client caching. And since you need speed, you use a Content Delivery Network (CDN) as well. Here I assume you use the site as CDN origin, i.e the images aren’t uploaded to a CDN storage.

Now, using multiple content delivery servers behind a load balancer, this usually causes problems. Let’s say you update something that’ll change the image and its URL (the versioning part). During the publishing sequence of your CMS, you’ll have a state where some content delivery servers are updated and some are not. The time may be short, but it’ll be there (*).

Since we’re talking large scale here, you will now have a problem. A visitor to your newly published site will get an updated html page with the new image URL. The client browser will then load the new image, typically through the CDN, and that request eventually goes into the content delivery cluster. Where will it end up? Possibly (or probably if your load balancer considers backend latency) on a server that hasn’t got the new image yet. We’re still in the middle of a publishing process, right. So, depending on your implementation, you’ll serve an old image, generate a new image based on old data, or give a 404. All are bad.

To make things worse, it’ll probably get cached by your CDN, so now all visitors will get an incorrect image.

This is a bit tricky and my experience is that this problem is overseen by many. I haven’t found a silver bullet for this yet, but the way I solve it, is to customize the cache headers bit. When the source of a generated image is new, let’s say less than an hour, the handler generating the scaled image, should respond with short cache timeout, like an hour or so. This means we still leverage from some caching, making your current site visitors happy, and after a short period of time, when the source is older than an hour, we can extend the cache lifetime headers to be a year or so.

The length of this shorter cache period depends on how long your publishing process is (worst case), your CDN requirements, and how fast it’s required your content to be surely updated after publish.

I think generated images is a quite common scenario, but this problem is not limited to images. You’ll actually face the same problem with common libraries, such as Combres as well. The version number part of a Combres URL is calculated from the source files timestamps. Since you can’t write all files to every server in one atomic operation (unless they all run from one share – and you probably don’t in a large scale setup), the same kind of problem will arise.

(*) You can avoid this problem if you can take 50% of the servers offline from the load balancer, complete the publish process to those servers. Then, in an atomic operation, bring all those online again and take all the others offline. Complete the publish process on those and bring them online as well. This is complicated, and when can you accept 50% as reduced capacity in a large scale installation? Using two servers, yes sure, but as it grows… I guess you can’t.

Ctrl+Z

Mikael Högbergs blog

A common problem with large scale web applications