Forums

How to memcache?

I'm using Flask to fetch data via FTP and feed Google Spreadsheets (which do not understand FTP) and let them generate interactive charts ready to blog. The data is updated once per day, so it should be enough to fetch the data every 4 hours. The Flask documentation mentions a memcache server: http://flask.pocoo.org/docs/patterns/caching/

How can I use it to avoid Google Spreadsheets getting mad and force the script to download the data every 5 minutes and may put my app on the blacklist of the data provider?

Why not set a flag that tells you when you last pushed new data to Google Spreadsheets so you don't check unless the data has been updated? That way you'll only need to hit them once per day after the update has been successful.

The memcached server they refer to is a distributed in-memory caching layer to alleviate database load on large-scale websites and the like. These setups are often backed up traditional relational databases which can become a significant bottleneck when the site scales to a large number of users. The obvious solution is to reduce the number of accesses to the database by temporarily storing (or "caching") results in local memory on each webserver, and memcached extends this idea by caching data across many machines in the network to make better use of spare memory. Another solution is to move away from classic relational databases to one of the so-called "NoSQL" solutions (typically Cassandra or HBase).

Still, that's all just for background interest - for what you're doing, you absolutely do not need any of the fancy things I've mentioned above.

If I understand your situation correctly, you have a data source somewhere on the web which updates on a roughly daily basis and makes its information available via FTP. You're attempting to write an online service which receives requests via HTTP from Google spreadsheets and responds with the most recently-obtained data. These HTTP requests could come in very frequently, and you're trying to avoid making frequent FTP requests because the data provider doesn't allow this. Am I right?

To do this you'll need to find a way of storing the data locally in your PA account. This should be easy, as even free accounts have 500MB quota (unless the data is really large, but I'm assuming not). Since the data is FTPed, it must already be in a file-oriented form, so the on-disk format shouldn't need much work. I think you have two basic approaches:

  1. As a2j mentions above you could FTP the data on-demand from the source and store it on disk. You need to store the time at which you fetched it - you could use the modified time on the file, but due to the slightly unusual storage back-end that PA uses you might prefer to instead generate a second file with a timestamp (or whatever you fancy). If further requests come in less than 12-24 hours after this time, you simply serve the data stored on disk. For later requests, you replace the saved data and update the timestamp and serve that instead.

  2. You could set up a regular daily scheduled job which downloads the data from the data source. This stores it on disk, and then requests from Google Spreadsheets are then always served from this information.

Personally I suspect the latter option is the simplest to implement, and has the advantage of avoiding a high-latency request in the case that the data source is unresponsive - frankly, I trust FTP about as far as I can throw a cathedral. It's personal choice, however.

Thanks Cartroo,

I think #2 is indeed the least complex solution, also easily scales with more files. However, now the burden is on the file system. Is there an option to put the data into a variable or is the app reloaded an each request? I really want to be sure the app stays within its quota and keeps serving, whatever happens on Google's side.

Edit: Found out how to build a 3 layer cache (mem, file, extern). Looks pretty fail-safe now.

Glad you've found a solution! Do let us know if you need anything else - we can always temporarily bump up your quota if you suddenly run out of space...

Out of interest, what are the charts of? Are the blogs something we might find interesting?

If you want to make sure that the storage is persistent, you really need it to be on disk somewhere. The problem is that the webserver may spread over several threads, several proceses or even several physical machines in principle. This means that you need to put the data somewhere where all of these will have access.

Sounds like you're using in-memory caching, and that's fine - just remember that your web application may be split over several different processes (potentially on different machines!) so you can't assume they'll all have access to the same structures. If you're really doing caching then that's fine - if the process doesn't happen to have accessed the underlying storage yet then it'll just promote the data to its local cache. However, if you're relying on having that data in memory, you may have problems.

Anyway, sounds like you got things going, which is great.

That's excellent advice. Thanks, Cartoo

@Cartroo, if I understand you correctly my app may have several instances sharing same file system, but not memory. So, let's say there are 100 instances, each will make a daily ftp request and may again blacklist my app... I see two options, defining a upper limit of instances (no need to have that much instances) or somehow put a flag on the file download first within 4 hours and let all other instances use that file instead of hammering the ftp server. Do you agree?

@harry, thanks for you're interest, arctic.io started using this scientific ftp data to Google Spreadsheet to show daily sea ice development in the Arctic. There is still sometimes an issue, if sheets import data from other sheets, leading to empty charts. But I'm close to solve it with a new escalation scheme, Google is not doing this reliable.

@Cartroo, if I understand you correctly my app may have several instances sharing same file system, but not memory.

Yes - all the instances running on PA share the same underlying filesystem, even if they're running on different servers. So, anything you store in a file will be available to all instances.

It's possible that some of your instances may be running as threads in the same process, and this means they will share memory. So it's possible to write your application to use memory and it will appear to work. However, at a later time it's possible for some instances to run in a different process, or on a completely different server - that's just the way things are when you're running in the cloud. So, you can't assume that you can share memory between instances even if it sometimes seems to work.

I see two options, defining a upper limit of instances (no need to have that much instances) or somehow put a flag on the file download first within 4 hours and let all other instances use that file instead of hammering the ftp server. Do you agree?

Limiting the number of instances might help somewhat, but doesn't solve the underlying problem. Let's say you limit the instances to 1, so you can use in-memory storage. What will probably happen is that your instance will go idle for some time, and maybe the system will terminate that process and start it up somewhere else. At that point your memory storage has disappeared.

The second option you mention, to store the data on the filesystem, is definitely the best option in this case, I think. It also shouldn't be too difficult - Python makes it quite easy to do filesystem access. You could easily do something like this:

import fcntl
import os
import time
import urllib2

class FileWithLock(object):

    def __init__(self, filename, mode):
        self.filename = filename
        self.mode = mode
        self.fd = None

    def __enter__(self):
        self.fd = open(self.filename, self.mode)
        lock_mode = fcntl.LOCK_EX if self.mode[0] == "w" else fcntl.LOCK_SH
        fcntl.lockf(self.fd, lock_mode)
        return self.fd

    def __exit__(self, t, val, tb):
        fcntl.lockf(self.fd, fcntl.LOCK_UN)
        self.fd.close()
        self.fd = None
        return False

def get_data():
    data_file = os.path.expanduser("~/my_cache_file")
    data = None
    try:
        info = os.stat(data_file)
    except OSError:
        info = None
    if info is None or time.time() - info.st_mtime > 3600 * 4:
        with FileWithLock(data_file, "w") as fd:
            data = urllib2.urlopen("http://www.example.com/data").read()
            fd.write(data)
    else:
        with FileWithLock(data_file, "r") as fd:
            data = fd.read()

Now I'm not trying to say this code is brilliant (hard-coded URLs and filenames, no proper exception handling, etc.), but it's just an example to get you going. Note the use of file locking to make sure that concurrent requests can't corrupt the cached file. Some might say this is a little paranoid, but I don't like to assume that the Python write() call maps to an atomic underlying operation, especially on virtualised filesystems like on PA. If something's worth doing, it's worth doing properly!

All that said, you might find it easier to look into httplib2, which is already installed on PA; or a combination of requests, already on PA, and requests-cache, which you'd need to install yourself. Either of these solutions will do transparent caching for you so you don't need to worry about the details.

As an aside, since PA already has requests, perhaps it might be useful to install requests-cache as well?

@Cartroo Wow, thanks! You've eliminated some hurdles (file access) from my way. Will definitely switch to requests and read the docs of request-cache. Have to check out how to control cache, in some cases I need to overwrite caching. The first option (number of instances) has been dropped, the code should work or any number including zero.

I'm happy to help. Using third-party caching code is likely the simplest solution, but there are potential gotchas to look out for. In particular, some caching solutions may make conditional requests every time you initiate a request to verify the cache integrity.

From a very brief experiment with requests-cache, however, it seems that it's much simpler than that - it just caches the data for the time you specify, and once that's expired it makes an unconditional request for the data again. To be honest I'm a little disappointed that it's so simple, but it should work fine for what you want.

I made may way through the docs and found out that requests can even support OAuth. That means I can do all stuff http with one lib, great. Request-cache also lacks individual time per url, may be via different cache names. But that's not important for me, not now. Any way, not going to write import urllib any longer. Also the project is young, only 4 month history at github.

One warning here: requests currently has a bug in its proxy-handling code, which means that it can't access https sites when going through a proxy (which is the case for free accounts, but not for paid-for ones). So if you're accessing stuff over http then you're fine, but https won't work with it. We're tracking the bug and will upgrade our version of requests as soon as the bugfix is released -- it looks like it's being actively worked on, anyway.

BTW I've added a ticket for getting requests-cache installed, it looks like it'll be easy to get in next time we do a batch of new packages. In the meantime, pip install --user requests-cache should work fine.