Forums

Webapp --- delay in responding

I have noticed that my www.bgtrain.com webapp exhibits a roughly 5-second delay when responding to the first HTTP request. This happens randomly, for about 1/20th of the requests. I have checked my code and found nothing that could cause this. Can you PA guys help here?

Are these requests after the app has been idle for a long time (about a day)? If apps don't get any hits at all then we switch them off and reload them on the next hit, so that would explain the what you're seeing. However, if it's 1/20th of the hits regardless of the activity of the server then it would have to be something else causing it.

The app is getting about 800 views per day so it is definitely something else.

Could you share the specific URL that's causing the problem? If you want to keep stuff private then use the "Send feedback" link above.

I think that it happens in any URL, but most of the times I have observed it on those two:

http://www.bgtrain.com http://www.bgtrain.com/commentslist

Hmmm. Strange. As far as I can tell from the logs, the delay is happening once we've handed control off from our infrastructure to your code. The machine doesn't seem to have any problems with load. What are those pages doing? For example, are they hitting a sqlite/MySQL database?

Yes, they are hitting the database. A few weeks ago I added some logging to the code that obtains the connection to the database but I saw no irregularities... I also log all queries together with the time they took, but I don't remember seeing anything suspicious there either. Since you are saying that the delay happens in my code I guess I'll have to do some profiling, although the problem does not appear on my test machine.

I checked the database log since January 5th and there are no queries that took suspiciously long to complete... The longer one took 5.8 seconds, but that only happened once. And the query that runs on

www.bgtrain.com/commentslist

is very simple, since it effectively looks up in a table that has about 300 rows.

That is curious, a query like that should certainly be quick.

I've double-checked and you're not using much memory, so it's not that part of your stuff is likely to get swapped out (which would be separate to our own built-in shutdown of unused apps).

And I've double-checked an app with similar traffic on the same server, and it doesn't show the same problem.

Are your databases on MySQL or sqlite?

It's MySQL.

OK, so it's not filesystem delays. And I've checked the load on the MySQL server associated with your account and that looks fine.

Have you noticed any clustering to the times of the slow loads? I was able to repro it several times quite easily this morning (UTC) but just now it took 17 hits to get one slow one.

No, I haven't noticed any clustering... I think it happens more often on the first visit to the website but this is not a rule as you just verified yourself.

This is really odd. Would it be easy for you to add logging of view run times to your code to see if for some reason we're running it more slowly at these times? Given that the problem seems to be limited to specific views, it would probably only be needed on one of those to generate useful data.

I can do that, but don't you have that information already in your logs?

I'm wondering if our logs are incorrect. I can't see any way in which the numbers couldn't add up, but I could be wrong.

Just to add to that -- our logs might be excluding some time period while the operating system was doing something like loading up your code. They shouldn't, but they might.

Back with some more information... I added logging and it seems that the time is lost pent in python code (i.e. not waiting for the database.) Look at the following bit from the log, which times a complete handling of the home page of my webapp:

INFO 2014-03-10 19:47:52,311 gnubg.webapp: Starting timing of index() call
INFO 2014-03-10 19:48:03,372 gnubg.webapp: Time taken for connecting to db: 0.006
INFO 2014-03-10 19:48:13,114 mysqld: Query execution took 0.002 seconds
INFO 2014-03-10 19:48:13,136 mysqld: Query execution took 0.001 seconds
INFO 2014-03-10 19:48:13,163 mysqld: Query execution took 0.001 seconds
INFO 2014-03-10 19:48:13,213 mysqld: Query execution took 0.002 seconds
INFO 2014-03-10 19:48:13,244 mysqld: Query execution took 0.010 seconds
INFO 2014-03-10 19:48:13,265 mysqld: Query execution took 0.003 seconds
INFO 2014-03-10 19:48:13,285 mysqld: Query execution took 0.001 seconds
INFO 2014-03-10 19:48:13,307 mysqld: Query execution took 0.003 seconds
INFO 2014-03-10 19:48:13,331 mysqld: Query execution took 0.005 seconds
INFO 2014-03-10 19:48:13,362 mysqld: Query execution took 0.011 seconds
INFO 2014-03-10 19:48:13,471 mysqld: Query execution took 0.001 seconds
INFO 2014-03-10 19:48:13,521 mysqld: Query execution took 0.001 seconds
INFO 2014-03-10 19:48:13,529 gnubg.webapp: Time taken for index() call: 10.162

The above is amidst numerous other responses in the range of 0.5 to 1.5 seconds... So there is definitely something wrong here.

Here is some more logging:

INFO 2014-03-10 20:02:19,478 140459967780736 gnubg.webapp: Time taken for index() call: 0.274
INFO 2014-03-10 20:02:41,379 140459967780736 gnubg.webapp: Time taken for index() call: 0.412
INFO 2014-03-10 20:03:06,007 140459967780736 gnubg.webapp: Time taken for index() call: 0.454
INFO 2014-03-10 20:03:32,596 140459967780736 gnubg.webapp: Time taken for index() call: 0.507
INFO 2014-03-10 20:03:58,862 140459967780736 gnubg.webapp: Time taken for index() call: 0.317
INFO 2014-03-10 20:03:59,385 140459967780736 gnubg.webapp: Time taken for index() call: 0.492
INFO 2014-03-10 20:03:59,838 140459967780736 gnubg.webapp: Time taken for index() call: 0.424
INFO 2014-03-10 20:03:59,956 140459967780736 gnubg.webapp: Time taken for index() call: 16.400
INFO 2014-03-10 20:04:00,261 140459967780736 gnubg.webapp: Time taken for index() call: 0.310
INFO 2014-03-10 20:04:00,585 140459967780736 gnubg.webapp: Time taken for index() call: 0.225
INFO 2014-03-10 20:04:05,965 140459967780736 gnubg.webapp: Time taken for index() call: 0.298
INFO 2014-03-10 20:04:25,242 140459967780736 gnubg.webapp: Time taken for index() call: 0.345

Interesting. So it's not the database, but something is running slowly elsewhere in the index() call.

Could you post the code of that method? Or, if you're not willing to post it here publicly, could I take a look? We don't look at anyone's private file storage as a matter of policy, but if you say that I can look here then I'll do so; just tell me which file it's in.

I don't think it's a problem with the code. It happens on all pages, I just elected to put the logging on the index() call. And if you look at the times, you see it happens randomly, there is absolutely on reason it should be taking so much time once in a blue moon.

Just to be a bit clearer, the index() call is made when the root page is requested.

It's possible that there's something in your code which, while there's no obvious reason why it would suddenly take a long time under normal circumstances, might be interacting weirdly with something specific about PythonAnywhere; at least, that's all I can think of right now that might cause a problem like this. Either way, it's hard to diagnose further without seeing the code.

Ok don't bother then, I will just run full profiling to see what exactly is going on.

OK. If you do find anything weird, I'd love to know what it is.

Sure, I'll post here if I find something.

All I want to confirm at this point is that you guys have not seen this with another webapp, am I correct?

That's right. The only sporadic-delay problem we've seen has been the problem with idle apps being shut down that I mentioned at the start of this thread.

That's not to say that there are no other apps with this problem, of course -- just that no-one else has reported it and we haven't detected it with our monitoring.

Okay, the normal profiler reported roughly 250 milliseconds against the wall-clock 20-or-so that I was counting, so I made a custom profiler to count wall-clock time and from the output (which you can see below) it is evident that it's the logging system, and specifically the seek() call inside the shouldRollover function which causes this. This call is normally fast but sometimes it takes about 18-25 seconds. I guess it's a glitch in your NFS setup, but please advise.

     12177 function calls (12157 primitive calls) in 24.276 seconds

Ordered by: internal time
List reduced from 236 to 20 due to restriction <20>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   75   24.133    0.322   24.133    0.322 :0(seek)
   12    0.044    0.004    0.044    0.004 :0(query)
    1    0.004    0.004    0.007    0.007 /usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py:62(__init__)
    1    0.004    0.004    0.004    0.004 :0(setprofile)
  189    0.004    0.000    0.016    0.000 /usr/lib/python2.7/logging/__init__.py:451(format)
  189    0.003    0.000    0.003    0.000 :0(localtime)
  189    0.003    0.000    0.008    0.000 /usr/lib/python2.7/logging/__init__.py:404(formatTime)
   12    0.003    0.000    0.011    0.001 :0(fetch_row)
  114    0.003    0.000    0.012    0.000 /usr/lib/python2.7/logging/__init__.py:828(flush)
  228    0.003    0.000    0.004    0.000 /usr/lib/python2.7/threading.py:146(acquire)
  471    0.002    0.000    0.006    0.000 :0(decode)
  228    0.002    0.000    0.004    0.000 /usr/lib/python2.7/threading.py:186(release)
  878    0.002    0.000    0.002    0.000 :0(append)
  471    0.002    0.000    0.008    0.000 /usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py:206(string_decoder)
  471    0.002    0.000    0.004    0.000 /usr/lib/python2.7/encodings/utf_8.py:15(decode)
    1    0.002    0.002   23.397   23.397 leaderboardlib_html:49(render_leaderboard)
   39    0.002    0.000    0.005    0.000 /usr/lib/python2.7/logging/__init__.py:242(__init__)
  114    0.002    0.000    0.028    0.000 /usr/lib/python2.7/logging/__init__.py:839(emit)
  114    0.002    0.000    0.002    0.000 :0(write)
  189    0.002    0.000    0.002    0.000 /usr/lib/python2.7/logging/__init__.py:311(getMessage)


slow_wc.prof% callers :0\(seek
   Ordered by: internal time
   List reduced from 236 to 1 due to restriction <':0\\(seek'>

Function  was called by...
:0(seek)  <- /usr/lib/python2.7/codecs.py:702(seek)(75)   24.134


slow_wc.prof% callers seek
   Ordered by: internal time
   List reduced from 236 to 2 due to restriction <'seek'>

Function                                was called by...
:0(seek)                                <- /usr/lib/python2.7/codecs.py:702(seek)(75)   24.134
/usr/lib/python2.7/codecs.py:702(seek)  <- /usr/lib/python2.7/logging/handlers.py:146(shouldRollover)(75)   24.143

Those shouldRollover calls appear to be caused from some RotatingFileHandlers that I had configured in my /var/www/blah.py configuration module. I have now removed them and it appears that the problem no longer appears, but I think you should still look into this issue because I need to use those logging handlers again.

Interestingly, the only seek call in shouldRollover has a comment that says it's there because of a non-posix-compliant windows feature. It looks like we need to work out what's going on and try to fix our infrastructure, but since it's related to user file storage and we're extra careful in that area, it could be quite a while before we have something workable.

However, there are a few things easy things you could try to work around the issue until we have a better solution:

  1. Don't use the RotatingFileHandler. Use a TimedRotatingFileHandler instead, then you won't be doing file seeks in the middle of your requests.
  2. Create your own version of RotatingFileHandler that excludes the seek line. It could be as simple as copying the class from the Python source, renaming the class, deleting the line and using the new class to do your logging.

I think you guys need to look into this right away. The reason is that, since I disabled the RotatingFileHandlers the problem exhibits itself more rarely, but it's still there, and it slows down things way too much when it happens. Look at this profile run, which I took yesterday, with the RotatingFileHandlers disabled:

        8177 function calls (8149 primitive calls) in 25.563 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   25.563   25.563 profile:0(<bound method Application.f_ of <gnubg.webapp.Application instance at 0x23cf128>>)
        1    0.000    0.000   25.549   25.549 webapp.py:91(f_)
       45    0.001    0.000   22.619    0.503 logging/__init__.py:1140(info)
       45    0.001    0.000   22.618    0.503 logging/__init__.py:1250(_log)
       45    0.001    0.000   22.608    0.502 logging/__init__.py:1271(handle)
       45    0.000    0.000   22.607    0.502 logging/__init__.py:1303(callHandlers)
       45    0.001    0.000   22.606    0.502 logging/__init__.py:736(handle)
       45    0.000    0.000   22.599    0.502 logging/__init__.py:931(emit)
       45    0.001    0.000   22.599    0.502 logging/__init__.py:839(emit)
       45    0.001    0.000   22.588    0.502 logging/__init__.py:828(flush)
       45   22.585    0.502   22.585    0.502 :0(flush)
        1    0.000    0.000   15.396   15.396 webapp.py:35(f_)
        1    0.000    0.000   15.396   15.396 webapp.py:192(f_)
        1    0.000    0.000    3.083    3.083 webapp.py:144(f_)
      2/1    0.000    0.000    3.083    3.083 common.py:113(t)

I need some kind of estimation on when this will get fixed, because the current situation is unacceptable.

Unfortunately, we can't commit to a time-scale on this because we don't know the cause. The most likely cause is something that we're investigating now, but we're not yet sure whether our investigations will lead anywhere useful. It's also possible that the slow downs are an inevitable consequence of the architecture that we use and that there's nothing that we can do about them.