Forums

Flask web app suddenly showing "502 Bad Gateway"

Problem cropped up suddenly on the web server. App runs fine on my local dev server. Please help!

Having the same problem, seems to be an issue on PythonAnywhere's side and not just with Flask apps

We're investigating.

Looks like one of our web servers got overloaded, which took out web apps that were hosted there. We've switched everything over to a backup, and I've double-checked that all web apps belonging to people who've posted on this thread are running.

Apologies for the outage, we'll look into the underlying cause and why we didn't get an automated alert tomorrow.

Tomorrow? It was already tomorrow when you wrote that...☺

Of course I know what you meant. Be sure to get some sleep now...☺

The problem with monitoring systems is that it's very hard to notice when they stop working. Quis custodiet ipsos custodes?

Indeed... But it really looks like there was a case here that our systems didn't pick up - most web apps were fine, it was just new ones and those that hadn't been hit for a while.

Ahh, nasty. So presumably the monitoring uses a well-known sample service which is kept nice and active by the fact that the monitoring system polls it frequently, and hence it didn't get "idled out" and suffer the issue. Ouch!

Precisely. And (thinking about it) the regular polls from the monitoring service (we use Pingdom) would have kept it awake anyway...

What about having every service you want to ensure is functioning perform regular log entries. Then if you can't monitor the expected log entry you alert. Or more than likely I'm missing something...☺

Log monitoring is one approach, but generally not favoured due to the risk that the logging is working but the something else is broken. Generally you want your monitoring to be as close as possible to a user-facing interface so you can catch the widest set of potential issues. In complex systems, it's notoriously hard to predict all the possible failure modes.

Also, web apps are essentially reactive - they don't have any sort of regular timer to use for logging. Daemon-like services could do it

So I assume the PA devs have a dummy user account or something that has a web app with a known response that they can check for. But the fact that the monitoring system kept making requests prevented this app from being "swapped out" and it seems this issue only affected swapped out services. This demonstrates the difficulty in predicting failure modes.

In an ideal world I suppose one would monitor real users' apps, but there are many reasons why this is impractical. Still, as long as new monitoring checks are added as new failure modes come to light, the system will become increasingly reliable. Monitoring is something that inevitably has to evolve over time to some extent.

What about having every service you want to ensure is functioning perform regular log entries. Then if you can't monitor the expected log entry you alert. Or more than likely I'm missing something...☺

We could definitely add some stuff like that, but I don't think it would have picked up any of our non-alerting outages so far.

So I assume the PA devs have a dummy user account or something that has a web app with a known response that they can check for.

We actually use our own blog -- it's a Django app running in a PythonAnywhere hosting account like any other, but we have complete control over what it returns so it's a good proxy for user web apps as a whole -- modulo the whole "never getting swapped out" thing.