Introduction to Web Caching
Posted on October 30, 2016
What is caching
Cache (pronounced like cash), is a system that allows your application to store data for faster access in the future.
Say you want to know what you earn per hour. You can find out what you make monthly and how many hours you work in a month. Finding out that information might take a short moment.
When you got this info, you can make a calculation. Dividing the income by the working hours obviously gives you the answer.
You now have successfully run a task. But you want to make sure you don't have to do this calculation again, so you write the answer down on a small piece of paper. Next time you want to know what the answer is to the hourly rate question, you can just refer to the small piece of paper. You can cut out the tasks of receiving the information and calculating your hourly rate. 'Saving' the answer on a piece of paper is essentially what a computer does when it is caching data. Computers generally cache things like data searches and computational calculations. When it receives the same question again, it looks trough its 'scratch-book' and returns you the answer in a much shorter time than when it has to do the same tasks over and over again.
The most obvious caching of a personal computer would probably be Sleep mode. When going into sleep mode, instead of shutting down all its work, your computer will save everything that is going on at that exact moment. It will 'recall' what it was doing, and resume those tasks when you wake up the computer. This is often much faster than booting it the standard way.
So how does cache work?
The main difference between cache databases and regular databases is that it is optimised for very simple search tasks, making it much faster than a conventional database.
Many databases can also do calculations, maintain relations and more complicated stuff. A cache database does not need this. It only saves computational 'questions' along with their 'answers'.
When a computer gets a task, like 'display the weather information' it's not going to check the weather every single second. In stead, it asks the cache if this question was asked already. The cache searches trough its simple database for the question 'display the weather information'. If it has found it, it can immediately display the weather. If not, the computer has to ask the internet for this information, and save it to the cache for later.
A problem occurs. What if the cache is 2 days old? You wouldn't get relevant weather information. The computer has to make sure the cache gets refreshed once in a while. That's where I introduce a third value that gets saved with each cache item, TTL or Time To Live. This is a value that will tell the exact time that the cached item becomes irrelevant, or out-of-date.
When a cache finds the question, but the current time exceeds the questions' Time To Live value, the cache is refreshed. The computer is instructed to request new weather information from the internet, save it to the cache database, and finally, display it to the user.
When to use caching
Caching is at its best if it is implemented right where a lot of costly calculations are made. We put a so called caching 'layer' in between the part where a user requests some data and where the computer runs a task on its processor. That way the cache gets a chance of returning the data, having a much less performance impact on the system.
Caching is used commonly in the following three computional operations:
- Database querying
When you have a big database, you might want to add some caching layers in between your requests and the actual calls to the database. Caching could save commonly requested data from the database, including relationships between multiple databases. Any similar request would not be asked to the database, but in stead handled by cache.
- Large calculations
Simple calculations like currency conversions do generally not need any caching. On a complicated project though, there can be very complex calculations. A few years ago I created an application that could calculate the average distance between different locations of the same shop, using geographical coordinates. Since the amount of shops would not change that often, adding some cache with a 1 day Time To Live seemed like a good idea here.
- Requests of data from external sources
I already gave an example of this, the weather report. Some websites like openweathermap.org allow you to grab weather information and display it on your own website. If your server has to request this data on every page request on your website, it would become very slow. Instead we can cache this data, and save it for a few minutes.
The trade off is simple; You can get much better performance, but sometimes your data falls a bit behind. Setting a reasonable TTL is key in caching.
Caching on the web
There are many caching techniques and systems available. Only a handful are used by me. I will describe the ones I commonly use, and will go more in-depth about using and installing them in future articles.
Varnish is the one and only system I use that is completely dedicated to caching.
The easiest way to explain Varnish is that it will go to your website, take a screenshot of it, and display that screenshot to the next user that wants to view the same page. Instead of having to generate the whole page, it already knows what it exactly looks like.
The term screenshot falls a little short here, since the next user actually doesn't get an image. The 'cached' page can handle the same interactions as a non-cached page. Clicking on links, buttons, etcetera will still function like regular.
Varnish is implemented right in between where the page requests come in, and the application that normally would serve the request. Varnish will 'catch' requests that it's seen before, and saved to its database. It can serve pages very, very quickly.
You have to implement it correctly though. If you had something like a webshop, you wouldn't want the next user to end up with the same items in his/her shopping cart as the previous visitor. Something like that should not be cached at all, since the contents of a shopping cart will change regularly and generally differs per user.
It's described as an in-memory data structure store on the Redis website. Let's describe what that means.
In memory - means it is using the memory of a computer, which is a lot faster than a harddisk. Every computer uses memory in some way to be able to function for more than one task at a time.
Data structure store - means a database.
Redis is a way to save a (relatively small) database in the memory of a computer. By only saving important information here, while leaving the lesser requested information on the hard disk, an application can become very fast.
On pages where Varnish can not function, I use Redis to improve performance on commonly used functionality in my application. You can cache a lot in Redis, from lists of most read articles to complex forms and parts of administration pages.
Along with keeping the cache in the memory, Redis also saves cache to the harddisk. This does use some space, but it can be very handy in case of a server restart. Memory is emptied on a restart, while the harddisk is not. After a server reboots, Redis can easily fetch the latest cache from the harddisk and copy it to the memory for fast access.
Personally have not used Memcached a lot, but I know it is very similar to Redis. For as much as I know, Memcached is a bit easier to setup and manage. Memcached does not save anything to the harddisk, it only uses memory.
In my opinion, Redis is generally the better option. Redis slightly faster, and more scalable. It does require a bit more technical work though.
So now you know where you can add caching and where it should be implemented. Your visitors hate waiting, and many will leave your website if it is too slow. You can make your visitors happy and reduce server cost at the same time, so go and setup some caching!