Welcome Back 2013-02-28

I see the deploy wasn't completely smooth. I hope all the kinks are worked out now. I see the site and services appear to be back.

I look forward to hearing what made it in. When you get done putting out fires, Please take a moment to give us a run down...☺

Thanks, a2j :-)

Harry's put together a blog post about what went in, but the real biggie was the switch of our file server image to Ubuntu.

Here's what we thought would go wrong, and what actually did... probably in more detail than you want or need, but perhaps you'll find it interesting,

The risk we were concerned about was that it might take a long time to mount the old user storage volumes. When we go live with a new cluster, all of the volumes that store everyone's data are detached from the old cluster, attached to the new one, then mounted.

In our tests, we'd discovered that if we created a fresh storage volume from one of our backup snapshots of a real live user storage volume, then attached it to a file server based on the new image, it could take almost three hours to mount the first time. As all of our final tests that the new system could mount the old system's volumes were based around such snapshots, this was worrying, as it could have meant that we would be out for three hours.

But then we discovered that when we tried to mount such a snapshot on a server based on the old Debian image, it also took a very long time. So we thought that perhaps the slowness was an artifact of the "created from a snapshot" nature of the volumes we were mounting, and not something to do with the new fileserver image. In other words, it was the way we were testing, not the thing we were testing.

Unfortunately the only way to find out if this was the case was to do a deploy to the live environment, which is why we did it at 4am -- the three hours from 4am-7am UTC are our slowest time of day in terms of users, so it would cause the least harm if there was a 3-hour outage.

When we did the deploy, it turned out that mounting the old user storage onto the new cluster was almost instant. We were OK!

What we'd missed out on was something quite unrelated. We deploy new code from a checkout on an workstation in the office, as upgrading PythonAnywhere is the one thing we can't do from PythonAnywhere [images of someone performing brain surgery on himself]. This local checkout was on an Arch Linux VM I have running on my Windows 7 workstation; that's pretty much our standard procedure. But what I'd missed was that this time I'd checked out the codebase into a directory that was actually on the Windows filesystem, mounted into the Arch VM using VirtualBox's directory-sharing feature. This meant that the checkout did not have the right file permissions -- everything in such a volume has 600 perms. And those incorrect permissions were on the files that were copied up to the machines in the new cluster.

This meant that when we brought up the new cluster, a bunch of things that relied on our uploaded code having the right file permissions was broken. Most of this was easy enough to fix; we have relatively few dependencies on file permissions. But some of it wasn't; in particular, the console servers (which double as ssh servers) have some code that's executed before anyone logs in over ssh, and the file for that code had the wrong perms so it couldn't execute. Which meant that we couldn't ssh into the console servers at all. Which, of course, made it impossible to finish off the configuration and make them live.

We discovered this problem while the site was down, so we had to recreate the console servers while the site was down, which took a while. Once the new servers were up, we had to carefully put them into the running cluster; this isn't something we normally do while a cluster isn't live, but we were able to adapt the code we use to add a new console server to a running live cluster (when we're scaling up for heavier load) and that, after some scary moments, did the job.

Anyway, all should be fine now. Harry and I are handing over to the other guys, and it's time to go home :-)

Oh, and I should also say -- if you look at the main forums page you'll see that we've added the name of the last person who posted on each topic, as you requested :-)

Thank you for the detailed update. As always the transparency at PA is one of the things I value most! I know it should be the service, but I'm odd...☺

I'm excited about what this deploy means to the goals you/we have moving forward.

And, yes...I already noticed the last person who posted having made it into the live cluster. Hopefully that will be helpful to many users and staff.

I hope you sleep well...then get back to work & write us some more code...☺

Good stuff. That permissions issue sounds like a pain - I must confess, I try not let any of our source leak out on to the Windows machine I now have (last place I worked we didn't even have access to Windows). I slave the git repo on to the Windows machine as required, and then if I need to make local changes I merge them back into the repo on my main Linux desktop as patches. This can cause some line-ending pain, but it's pretty easy to deal with.

I had to restart all of my web-apps just now, it appears all of them were non responsive. Glad everything is back and up now though :)