Forums

Speed of webapps and performance. When to expect changes?

Hi there, First, I just want to say I absolutly love pythonanywhere. I love that I can quickly start a new project and try them out on a live server. I love the fact that if I absolutely needed to I can log in from anywhere and make changes.

I do have one hesitation so far. I am planning on routing traffic from my old website to my new pythonanywhere project. My friends and colleagues who I have been using as 'guinea pugs' to test out my new web app have been telling me the project is running extremely slow. Last night I was showing someone the app (as basic as it is) and I would click on a link and it would take upwards of 10 seconds to load a new basic page. It was rather embarrassing going through a basic presentation and clicking on a link, nothing happening, and then saying "It will load, the server is just a ltitle slow, upon advertising I will move it to a faster server'....That last part I don't want to do. I am perfectly happy at pythonanywhere and migrating would just be more time that I do not have.

I have read other posts of people saying this is an issue, with a response saying "Things are in the works to making the performance faster...', but there is no additional details. What is being done, and how soon can we expect to see changes? I really do not (cannot) launch my services to my clients and industry with such slow performance. I am already a paid member, if I upgraded to the next level would there be a significant difference? I love PAW, I just wish it was a little faster. Maybe this is going to be addressed soon...( how soon?) Thanks!

It's a good question. I love what I can do but am reluctant to deploy a simple project to PA as it seems slow and might go down when upgrades are being done to PA.

Is it viable to run a database app for a small (very small) business through PA?

Thanks.

@iconfly: If you look at the feature matrix you can see where the lines are drawn for improved performance by account type.

As you can see:

  • All paid plans currently get the same CPU allowance.
  • Bandwidth for the Premium account type is Low (Same as free accounts)
  • Bandwidth for Hosting and Pro accounts is Medium.

You can also see what type of account you have on that page (provided you are signed in) if you don't already recognize them by name.

@PA Staff: This does bring up a good point. It may be beneficial at times to be able to identify users here on the forum beyond paying/non-paying. I suppose it could create a privacy concern, but I know I personally wouldn't mind. Heck, you may get me to upgrade so I feel more important...☺

Hi, yes it appears I have the highest available plan.

Hi all, sorry for being so late to reply to this. We have been hitting some performance limits recently, and we have (as you know) been working on fixes. This biggest one is a switch from Apache to nginx as our front-end web server, where the initial work is done; it's now running through our integration system over the weekend, and (assuming nothing's completely broken) we should be able to push it to live soon. We'll keep you updated.

Here's why we believe it will help. Under Apache, when you click the "Reload web app" button, normally your Python web process will be killed, and the new one started. But sometimes, the Python process keeps running. So far we've not been able to determine precisely what it is that does that, but if we don't keep a very sharp eye on the processes running on the server, we can wind up with a large number of defunct processes sucking up most of the CPU on a web server instance, and everyone's apps get slowed down. It looks like this happened a number of times over this weekend, and we'll be trying to find out why over the coming days.

But nginx doesn't seem to have the same problem. Restarting the web app seems to kill the processes properly. So hopefully it will help. If not, we'll have to write something to keep an automatic eye for defunct Python processes and kill them.

Not to mention that nginx is simply faster and more efficient at serving high request rates than Apache, so the overall overhead should be lower.

Apache may be the swiss army knife of webservers, but sometimes all you really need is a very sharp katana.

adios apache... :D

Absolutely, nginx so far seems to be much easier to configure, and much more lightweight.

The weekend's integration run was promising -- quite a few errors, but a lot of them seem to be similar. So hopefully we'll get them all sorted quickly...

a2j welcomes nginx to PA with open arms!!

It looks like the nginx upgrade has been and gone - it might just be me, but the website seems a little snappier.

(I've been away for a few weeks tending to the small matters of our first child being born and starting a new job, so I've only just had a chance to come back and see how things are!).

Congratulations Cartroo! Amazing news.

We are seeing an overall decrease in latency. The fastest requests are no faster but the average is much more bunched up around a lower middle. Still some bugs to be ironed out of course. Of course it also means we can start releasing actual improvements as well.

Again though: Congratulations!

Hi PA team,

I switched to the Pro account in the hope that running the web app is much speedier. In the beginning of my testing I was pretty pleased, but in the meantime I am not seeing any difference between free and the paid account. Are you still working on server problems?

Currently + reloading the web app + displaying the error log + execution of REST APIs takes a very looooooooog time

Could you give some feedback on the current PA deployment.

Thx

Hi roman,

Could you give us some example queries we can run so we can identify where the problem lies? The file view at the moment is unacceptably slow. It's doing a lot of work with every query. The reload web app takes time, but it also fully loads the Python app into memory. You don't have to watch it finish. But when it does stop spinning it means that your application should have already completed a request.

Neither of these should have anything to do with your own application/s though. So I would be interested in getting some sample queries we can run through and figure out what is going on. If the information is private then drop us an email at support or through the feedback dialog.

Hansel,

Thx for your response. Was is the email address I should use? I will prepare some test data that you can use the API. Thx in advance for helping! --Roman

Hansel,

I noticed that in some cases I am getting a 504 Gateway Time-out error.

Furthermore the reloading time of the app is not consistent. Sometimes it is very fast (10 sec) and in some cases it's over a minute. This unreliable behavior was seen using the free account. I expected that with the pro account the response times on this basic app management functionalities is more consistent.

When I get a reliable connection speed, using PA is enjoyable, otherwise I would not recommend switching to a paid service.

Let me know how I can help, that you are able to find out possible latency and instability issues. I will send out more information tomorrow. It's already late here in US ...

You can measure the reload time of your application a little more reliably by instantiating a class at the top level of your source file which logs the time to a file in both __init__() and __del__().

Here's a quick and dirty little example I just tried:

class ReloadTracker(object):

    def __init__(self):
        self.filename = os.path.expanduser("~/mysite/startstop.log")
        with open(self.filename, "a") as fd:
            fd.write("app starting: %.2f\n" % (time.time(),))

    def __del__(self):
        with open(self.filename, "a") as fd:
            fd.write("app finishing: %.2f\n" % (time.time(),))

tracker = ReloadTracker()

Hi all -- sorry again for the problems. We're busily trying to work out what the issues are -- it seems that the plus side of Apache was that it was well-configured for some of the things we're doing, and while nginx definitely has the features we need, fine-tuning it so that performance is acceptable is proving harder than we'd hoped, even after all of our pre-deployment testing.

@romanbaumgaertner -- any information you have would be much appreciated, you can email it to support@pythonanywhere.com. The 504 errors you're getting are probably when our server is processing your requests so slowly that the nginx process that sits in front of it is timing out. It's interesting to hear about the wide variation, though -- that's useful debugging information, for which many thanks. Sadly, although the resource limits for non-free accounts are higher, the problem right now is that many apps are (for reasons almost certainly related to our own configuration) hitting even those limits for fairly trivial processes. That's what we're working to fix right now.

@Cartroo -- thanks!

Some good news, hopefully! We've found a problem, and fixed it. It will take just under an hour to take effect, but that means that after 15:30 or so UTC today, everything should be back to normal. There may still be slowdowns until a full fix is in place, but these will be limited to between 06:00 UTC and 07:30 UTC daily. We'll address that ASAP, of course.

For those who are interested, here's what was happening.

The service was being intermittently slow. Until today we were unable find a boundary point -- that is, a point at which it switched from being OK to being slow or vice versa. At around 13:20 UTC today we observed one; in the hour or so before then -- at least as late as 13:15 UTC -- we'd seen that the server was unacceptably slow. Shortly before 13:20 UTC, it recovered.

When the system was slow, we could see that all of our web serving processes were maxed out waiting for IO to complete. When the system recovered, they were not. We'd known that something had been causing an IO bottleneck, but until we'd spotted the timing of the problem, it was very hard to work out what was doing that.

Given the timing, it was suddenly easy to spot the culprit. One of our file backup processes runs every two hours. Investigation showed that it had started at 12:00 UTC today, and then completed at 13:17 UTC. It appeared that the problem was the backup process. It was hogging the IO bandwidth, and that was having a knockon effect on nginx.

We waited until the process kicked off again at 14:00 UTC to confirm this hypothesis. Our prediction was that if we were right about the cause, the server would be fine at 13:59 UTC and then at 14:01 UTC all PythonAnywhere-hosted web applications would suddenly slow down, and we'd see IO waits rocket up on all of our severs.

That's exactly what we saw.

So, for now, we've rescheduled the backup process. Until we've sorted out the underlying issues, the backup will run at 6am UTC, which is our slowest part of the day. This means that web applications will still have a slow response time from 6am UTC to 7:20 UTC, which is a problem but hopefully will be more acceptable to everyone as a band-aid solution.

In the meantime, we'll address the two underlying problems that we have -- firstly, the fact that our nginx solution, unlike the Apache one we had previously, is taken out by IO issues like this. The second is the fact that our backup process eats up so much IO capacity and takes so long. Hopefully we'll have a fix for both of those soon, and we hope that our band-aid fix gives everyone the performance they need until then!

Thanks to everyone for their patience, and our apologies once more for the issues over the last few days.

All the best,

Giles

@giles - Thanks for the detailed information, it's always interesting to hear about these sorts of problems in case one runs into them oneself one day!

I'm not sure what your backup solution entails, but when I've hit problems in the past about spikes in usage (not so much resource limits as pushing up the 95th percentile in that case), I've always had good experiences looking for standard options for bandwidth limiting, to spread the spike over a longer period. For example, rsync has the --bwlimit option and scp has the -l option. If you're doing it remotely via something like HTTP, I'm sure many of those clients will also have similar options.

Of course, your backup situation is probably fairly complex, but if it comprises invocations of simple tools it might be worth a look in case you can modify the individual invocations in a standard way, at least as a short-term measure.

If the tools you're using don't supply the functionality, you might find the Linux traffic shaping functionality useful. Even if the server-side is something complex (e.g. something running in the cloud), then perhaps you could instead throttle client-side for the same benefit?

Anyway, just some thoughts, in case they're useful.

At the risk of coming off like a PA Fanboy I just want to say Thank-you for being open. That is something I can't say enough. The alternative bothers me to no end!

So, please, no matter how we grow or how much success we face as a community together, please keep the transparency! It truly does make all the difference.

Thanks, guys!

@Cartroo - ultimately it is an rsync process that's causing the problem, so thanks for the pointers there. The problem is that it looks like it's disk IO that's causing the problem rather than network bandwidth, so while we could slow it down, that might make backups too slow to be timely. It's a tricky balance to strike, and I suspect we need to do some serious sharding to fix the problem in the long run. Still, at least we know where the problem is now...

@a2j - hey, we need all the fanboys we can get - I just need to work hard on my reality distortion field... Anyway, agreed totally re: transparency - we're a small company, and I think the best way we can prove to people that we're worthy of being trusted with their data and their websites is to be completely open when we have problems and make it clear what we're doing to fix them. Secrecy might help Google, Apple and Microsoft in some way, but we're not them. Well, not yet, anyway ;-)

Don't get me wrong. I'm absolutely an unapologetic fan of PA, but my understanding of the term "fanboy" is one with negative connotation...☺

I was able to monitor the perf during the backup this morning. Looks like ionice might help... We'll keep everyone posted.

@a2j - oh, I see! Good point, I know some Apple users who respond to crashes with "oh, I must have done something wrong", they just can't believe that their iPhone/Mac might have a bug.

Just want to add my support for your policy of being frank and honest about problems, I really appreciate that!

Hmm, I haven't come across ionice - looks interesting, I'll have to remember that one.

I guess limiting the network bandwidth might also be a crude way to limit the disk I/O, but at the risk of simply making it less efficient (the same number of I/O transfers, just smaller chunks). I guess you're probably using -z, however, so the relationship between network and disk I/O probably isn't simple.

Depending on exactly what you're doing, I wonder if you could reduce your disk I/O issues on Linux using a tmpfs partition. If you're just cloning then I'm not sure it helps much (although rsyncing to tmpfs and then copying that to disk after probably allows the system to do more efficient disk I/O). However, if you're creating tarballs or similar you could do all that in a tmpfs partition and then just transfer the resultant archive to disk, which should save something.

Of course, tmpfs is volatile, so you'd have to live with a crash losing any backup in progress, and I've no idea whether it's available and/or useful on cloud-base virtualised systems. You'd also want to be able to bound the size of each transfer, so you can size your tmpfs partition appropriately. Might be worth a thought, though.