Django Caching with Backups

Django’s cache is useful to speed up access to data from slow sources, such as a remote RSS feed. But if the remote source becomes inaccessible, the data disappears after the cache expires it. One solution is to store the data longer, but that can lead to stale data. Another solution is to keep two cached copies for differing lengths of time, but that can quickly eat up a lot of memory.

One answer is to keep a copy in the cache, and a second copy on disk (or in the database) as a backup. The datastore module works a lot like the cache module, except that there is no explicit set() operation. Instead, get is called with the key, a function that will generate a new object for the cache, and the object’s time to live.

import datastore
data = datastore.get('foo', get_more_foo, 5*60)

What happens here? First, the cache is checked for an object keyed ‘foo’. If one is found, it is returned. If not, get_more_foo is called to create a new object, which is stored in the cache for 5*60 seconds, after which the updated object is pickled and backed up to the file system.

If get_more_foo raises an exception, the result of the last successful call to get_more_foo (the backup copy) is returned instead. The backup value is also put back into Django’s cache to keep access quick until the next attempt at updating ‘foo’, but for only half the regular time ((5*60)/2 seconds).

The only issue is when there is nothing in the cache and there is no backup. This can happen on the first call to datastore.get. In this case, the exception raised by get_more_foo is propagated to the caller.

Therefore, it is useful to wrap the call to datastore.get in a try/except block:

import datastore
 
try:
    data = datastore.get('foo', get_more_foo, 5*60)
except FooNotAvailable:
    ...

The datastore is controlled by three required settings (in settings.py). DATASTORE_DIR is the directory in which to place the backup file. /tmp is a good place for it. DATASTORE_CULL_AFTER is the number of calls to datastore.get before the DATASTORE_DIR is cleaned of old backups that have not been accessed recently. DATASTORE_CULL_TIME is the number of seconds since the last access, after which the backup becomes free for deletion (checked against file atime). This should be proportionately large to the reliability of the data source. Each successful call to get_more_foo will regenerate the backup, updating the atime. If the source is likely to be unavailable for hours at a time, a DATASTORE_CULL_TIME in the neighborhood of twelve hours may be in order.

Note – this file uses the with statement, so a recent version of Python is required.

download datastore

Leave a comment | Trackback
Nov 12th, 2008 | Posted in Programming
Tags: ,
No comments yet.