Overnight Outage 2013-02-22

Could you let us know what happened last night?

actually i get this: 502 Bad Gateway nginx/1.2.6 Help!

@mchurch: Welcome to PA. It's usually great here!!

As far as your 502...where are you getting it? I'm guessing you'll answer since I just tried it and got the same. I tried a few others as well. It would seem that even though the main site is back up, there are still issues. It's just after 9am at PA headquarters, so I'm sure they are on the case now.

Yup, we're looking into it. We'll need to find out why none of us were woken up too -- looks like it's been out since 05:43 UTC.

yes, correct, I should get some Json response, but nothing! So, as I can see, it's not just me!

..."We'll need to find out why none of us were woken up too ".. there was a party yesterday night at PythonAnyWhere!! :)

Sadly, there wasn't ;-)

Actually, it looks like Glenn was woken up and was logged in fixing stuff from 5:45 to 7:41 -- looks like the main site was sorted but then something (maybe the same issue) has now started affecting customer web apps. Investigations ongoing...

You mean PythonNowhere...☺

Hey, giles, we cross posted. My message was in response to mchurch.

I'm sure you'll figure it out!!

We definitely will. It looks like an out-of-control xfssyncd process on a file server was the cause of the problem at 5:45ish; Glenn fixed that by bouncing the affected server and some others that depend on it. Everything was fine by 7:45.

However, something else went wrong shortly after that (around 8:10) and started taking out users' web applications (while leaving the main site untouched). It may be independent, but the timing is suspicious.

We've got some web apps back and are working on the rest...

While things were down it gave me a chance to check out Resolver One...☺

It's pretty cool. I'm not sure why I didn't take a look at it sooner.

Nothing yet. I'm getting a time out now. Is it problem applicable only with free accounts? Tnks.

No, it looks like it affected both free and paying users :-( Not everyone, but quite a large number.

We have a process running now that is recovering the system; so far it seems to be working -- all users with usernames starting with "a" should be fine.

Good for me...☺

All user web apps should now be fine. We're investigating what look like intermittent problems with consoles on the consoles-1 and consoles-2 servers -- if you are seeing anything odd, we'd love to hear about it!

Everything is fine here now, tnks. Mchurch

I faintly remember issues of xfssyncd going into a tight loop from a few years (maybe 5 or so?) ago where the filesystem was close to capacity, but I thought they'd fixed that long ago.

I think some significant work was planned for xfssyncd last year, but that was about the point I stopped following its development. If you continue to have problems (and don't feel like migrating to ext4 or other FS) then it might be worth checking the changelogs for more recent kernels just to see if there are relevant improvements.

We're actually in the process of preparing a new version of our file server, based on Ubuntu Quantal instead of Debian Squeeze. This is because we hit a different limitation of the XFS version we were using, and working around that required a kernel upgrade... and we'd been intending to switch to Ubuntu for a while anyway.

We'll push that one out over the next few days (we have a few wrinkles to get out first) so perhaps that will fix the xfssyncd problem. Probably creating a whole swathe of new and different problems in the process ;-)

Ah, it'll be really interesting to see how things fare. From what I can tell there's been a bit of a resurgence of XFS development over the last few years, so perhaps it's worth sticking with it. They seem to have grasped the importance of heavy parallelism a bit better than some of the other filesystem teams, so it's quite possible it'll perform better on systems with large numbers of cores.

As of 2.6.39 I believe that the delayed logging patch is now the default, which should improve the historically slow metadata problem (not that we ever ran into it at my last employer, since we were dealing with massive files).

@PA: You say new version of "our file server" and that leaves me wondering how many different images you are using? Also whether you plan to migrate all to Ubuntu or not?

In the event that my use of the term images is confusing, I mean things like:

  • File Server
  • Database Server
  • Webserver
  • Etc.

There are 7 server types right now:

  • MySQL - These are just stock Amazon RDS instances
  • Proxy - Squid, for the web proxying for free accounts
  • File - user file storage on XFS volumes, sharing it over NFS, disk quota management, and Dropbox processes
  • Task - scheduled task execution in user sandboxes
  • Console - the Tornado-based server that talks to the in-browser console, plus execution of stuff run from there in sandboxes.
  • Web - user web apps and also the main PythonAnywhere site.
  • Housekeeping - sending bills, monitoring the other servers for problems, Beanstalk servers for other servers to send asyn task requests to each other, and pretty much everything else.

PostgreSQL support will be added by another server type.

On the live site, all of these are running 64-bit Debian Squeeze apart from the MySQL server, which is a MySQL appliance of some kind managed by Amazon. In our integration system, as you know, we're testing a Quantal image for the file server only; if it works out, we'll probably move the other "non-execution" servers (that is, the ones that don't run our users' code) over to Quantal next. If that works, the console, task and web server images will follow.

While I'm at it, perhaps it's worth explaining how we create these images and deploy a new cluster.

Historically, we had a "baseline" image, which was a barebones Debian image with pretty much nothing installed but the OS and ssh. We then had a "packages-included" image, which was that plus all of the Debian packages -- everything from nginx to beanstalk -- various versions of Python, plus the Python packages in our batteries-included list. We generate the packages-included image by firing up a fresh machine based on the baseline and running a whacking great fabric script which installs everything using a combination of apt-get, source downloads and compiles (for Python and some of the other stuff) and pip/easy_install, then runs a "generate an image from this machine" script. This takes a couple of hours to run, so when we install a new package on the batteries-included list it's just a few minutes work adding it to the script, then we run it in the background. When it's finished, we just put the new machine image ID into the codebase, check it in, the push the change to integration.

All of the different server types are created by firing up a packages-included machine, then pushing a few config files and, in the case of all but the proxy server, the core PythonAnywhere codebase -- this takes 10 minutes or so per machine, most of which is waiting for Amazon to fire up images and rebooting configured server. We can start a complete cluster in "configured" mode in the background in an hour or so (depending on how many machines of each type we need -- new ones can be added to the cluster later, but we try to size it correctly at startup), and then we push it to live by running a script that migrates our own database tables, then switches the user storage volumes and the outward-facing IPs over from the old cluster to the new one. That latter script is what causes the downtime when we upgrade, and we're actively looking for ways to make it run faster -- or, ideally, for having some way to allow the old and new clusters to share the user file storage and DB while they're both running so that we just need to switch the IPs over and do a zero-downtime deploy.

Now, to switch to Ubuntu, we've created a new baseline and forked the package install script -- so we now have Debian and Ubuntu baselines, and Debian and Ubuntu package-included images. The changes in the script are pretty small (Ubuntu being very similar to Debian) but unfortunately are too large to stay in the same script without it becoming a rat's nest of ifs. So while as a transition it's not too hard, we have to do it all fairly quickly because we don't want to have to maintain two package-installer scripts for any length of time.

Finally, I should say that all of these scripts apart from the final "switch over from the old live cluster to the new one" are run from PythonAnywhere itself. This means that (for example) one of us might have a console open with a checkout of our codebase with a packages-included image update script running, while we're coding new features in another checkout in another console. If, say, Harry is running the script and wants to hand it over to Glenn, then it's just a case of sharing the console.

All sounds like a perfectly sensible approach. I'm surprised that doing the installs takes so long, though - unless you were referring to the time taken to create the image itself? That's the main problem with disk images, I suppose, but if the creation is fully automated then it's no big deal.

Have you looked into something like puppet to do the configuration? I've only briefly played with fabric in the past, but it certainly made life easy for remote execution. I felt all it really needed was a more elegant way to push files over the SSH connection (that that one can't put something together oneself, but I always felt what was missing was a convenient way to say "here's a local template directory structure of how the remote end should look - make that so with minimal transfers, rsync-styley").

Also, I'm always a bit disappointed that the various service start/stop systems that have been developed to replace System V don't have a convenient way to say "I've changed a bunch of stuff - now start anything which should be running and isn't, stop anything which is running and shouldn't be (or was uninstalled) and reload anything where the configuration files have been modified". Or maybe upstart does have a cunning way of doing that, I don't know, but it would need the init scripts to be a little more clever than they are.

FWIW: I had the same "502 Bad Gateway nginx/1.2.6" problem and it was resolved after clicking the "reload web app" button.

@giles: Thank-you very much for the extended answers. It is truly appreciated. I really dig the PA transparency we experience here!!

@stefaanhimpe: Welcome to PA. I hope you have a great time here!!