Forums

scheduled tasks don't complete

I've scheduled an hourly task. The task runs every hour but doesn't seem to complete - "Currently Running Tasks" shows several of them now (one per hour), waiting to be killed. My app is written in web2py. The task is a bash script which calls two web2py/python scripts.

The python scripts don't ask for input. If I run the bash script from the console, it works perfectly and exits.

Any ideas? thanks

Can you add some debugging output to the task? maybe a few print statements, so that you can see how far it gets before it gets stuck?

It uses the Python logging module (import logging) to print its messages, but it's not printing anything... let me see if I can put a print at the very beginning. From the interactive bash console, it works flawlessly.

No love. The scripts don't print anything. The jobs seem to be stuck at the very beginning, waiting to be killed.

Would you mind taking a look at my bin/refresh_cache.sh script? It works fine and all debug messages are printed when running from console.

I have also had this problem before. And I still can't figure out what's happening (no log prints, no ressource consumption). Any explanations?

I found several instances running on the server. The problem is that the logs only get written when the process exits, and it looks like your script is hanging / never finishes.

You can use the "kill" link next to a running task to stop them, and then see what output they printed. Does that help you to figure out why they are hanging?

Another thing you could try is to setup the logging module to write to a file, eg in /tmp. That should happen immediately, so you will be able to see anything you write to it in real-time, either from a bash console or via the files tab..

Now everything seems to be working. Not sure why, but it seems like the processes finish now. I'll report back after the next couple of hourly runs.

I found an "echo hello world" job that I didn't submit; perhaps one of the devs tested something out on my account? Any ideas on what could have changed? thanks for the assist.

So, reporting back. The scheduled tasks are now running just fine. I didn't change anything, so I assume it was something server-side. Thanks!

@rcardenas -- thanks for confirming that they're working now. Harry and I will take a further look on Monday to see if we can pin this down a bit better.

Right, we can't see any problems on the server at the time you reported this. Is it possible that your script was just not exiting?

Thanks for your followup Giles. I don't think so; the script shouldn't take hours to finish.

It just started happening again today. Now I'm getting this in the log:

refresh_cache.sh: started /home/rcardenas/bin/refresh_cache.sh: fork: retry: No child processes /home/rcardenas/bin/refresh_cache.sh: fork: retry: No child processes /home/rcardenas/bin/refresh_cache.sh: fork: retry: No child processes /home/rcardenas/bin/refresh_cache.sh: fork: retry: No child processes /home/rcardenas/bin/refresh_cache.sh: fork: Resource temporarily unavailable

Could you take a look? thanks, best regards -Ricardo

If your script is short and doesn't contain anything private you could post it here for people to take a look at, in case there's some subtle problem we can spot. Make sure it doesn't contain any hard-coded passwords or similar, of course, and if you do post it, please indent each line by 4 spaces to make sure it's formatted nicely (and be careful not to change the indentation of the lines relative to each other in the process).

"Resource temporarily unavailable" is the error you get when you try and start too many processes. We have a ulimit of 32 processes per user on the task server, to prevent fork bombs...

Could your script be stuck in some kind of loop, trying to spawn more and more processes?

@harry @cartroo - Just came back from 15 days' vacation with no computer access. Thanks for your comments in the meantime.

The python (web2py) script doesn't spawn processes. It executes a few queries against MySQL, iteratively, and iterates over the results. I've now put debugging comments into my script, and after testing the job for a few cycles, I've identified that the python script is stopped after issuing a query to the MySQL database (not even a second since starting the script). It is not the first query in the script - in fact it is the same query that has run several times already, with different values. The script gets stuck at the query and never completes.

I suppose I could add logic so that the script terminates when it encounters this condition. However, I do need all the queries to run.

The strange thing is that the script runs perfectly fine, every time, when run from the command line (in an ssh terminal).

Does that help narrow it down?

ps. There are no forks, nor too many processes, etc. It's a simple linear script with queries and normal logic.

Hmm, OK. I reckon that the "Resource temporarily unavailable" error was happening because the processes were piling up due to it not exiting, and reached the limit of 32. But that still doesn't explain why they weren't exiting.

I see you gave us permission to take a look at your script back at the start of this thread, so I'll see if there's anything I can see there.

Definitely odd. I've confirmed that although your scheduled task appears to be running, there are no queries running as you on the MySQL server associated with your account.

There are two copies of your scheduled task running, though. And that in itself is a little odd; one was started at 14:59 and one at 15:13; I see you have scheduled tasks set up to start at those times. But I can also see that only the latter is shown in your "Currently running tasks" list.

Is it possible that the first one (started at 14:59) was still running without blocking, and then the other one kicked off, and because they were both running at the same time they got into some kind of deadlock?

@giles Thanks for your analysis.

I doubt the problem is db deadlocks. (In fact, only today did I separate the tasks into the two hourly tasks to aid debugging. Previously it all ran linearly off of one shell script, and exhibited the same behavior.)

Thanks for your offer to check the scripts. Let me explain briefly:

(1) refresh_cache.sh runs hourly and logs to /tmp/refresh_cache.log. (It actually contains some logic to not do anything between 01:00am and 07:00am my local time, nor on weekends.)

(1.1) refresh_cache.sh runs a web2py script called refresh_cache.py which logs to /tmp/seaceperu.log. It does some screen scraping and upserts to MySQL.

(1.2) Then, refresh_cache.sh runs a web2py script called send_change_alerts.py, which queries the database for new content, alerts the users via email, and updates a few records in MySQL.

Step (1) as a whole takes under a minute, sometimes up to two.

(2) The web2py script new_prospects.py also runs hourly. It logs to /tmp/new_prospects.log. (It should be running off of refresh_cache.sh, but I took it out for debugging purposes.) This script queries the database, may send some emails, and may do some MySQL updates. It seems to stall at different places each time, which appears to rule out some problem with the queries or the data. This script also normally completes within a minute.

Steps (1) and (2) always work fine when run off the command line. But when run from the Schedule tab, the jobs can get stuck at different places; if left unattended, they do indeed start piling up.

(3) There is a daily web2py script called send_nearby_milestones.py which queries the database and sends some emails. I haven't yet seen it give any trouble.

Please feel free to review the code or check the mysql logs. I'll keep debugging to see if it's a problem in the code, but right now it seems to me that something is amiss with the MySQL connection. I doubt deadlocks are the issue because this is all linear. Thanks for your assistance!

@PA @giles - I've kept debugging and getting strange results. The scripts always complete perfectly when run from the command line. But when I run them from Schedule, they rarely complete.

I've continued adding logging and debugging to my scripts, and I've found that the scripts stop at different and unexpected places, but not far off from each other (sorry, difficult to explain :).. Sometimes the scripts aren't even doing database work when they stop. For example, just recently a script stopped between two log statements separated by an if! like so (pseudocode):

log.debug("Print #1")
if i > 0:
    log.debug("Print #2")
else:
    log.debug("Print #3")

The log only shows up to "Print #1" - it does not print #2 or #3. Very strange! How could this be? Is there some kind of CPU limiter or other limiter imposed on Scheduled tasks? My scripts only run for about a minute or two. And I'm always way under my 20,000 CPU-seconds limit.

Is there a more direct channel I might use to interact with a Support rep or Dev? I can't rule out a bug in my scripts, but the behavior I'm seeing is very strange. To emphasize, all scripts work perfectly when run from the ssh command line. thanks -Ricardo

@hansel @giles - I've continued debugging and uncovered something even stranger. On a lark, I decided to add consecutive logging statements inside my main loop, like so:

some code
log.debug("logging again")
log.debug("logging again")
log.debug("logging again")
log.debug("logging again")
some more code

My script is still hanging at random places -- but, believe it or not, once it hung in the middle of these logging statements. i.e. the log only shows the first two of these log statements, and then nothing more.

So now I'm at a real loss for explanations. This doesn't seem to be a problem with my code. It seems to be some limitation imposed by (or on) the server that runs the Scheduled Tasks?

@hansel Thanks for the g+ hangout earlier. I added exception debugging, but no exceptions have been logged.

I believe we have a suspect: There is a bug in our scheduled task runner that means that if the process produces too much output, a buffer gets full and the whole thing locks up. Try adding > /tmp/scheduled_task_output 2>&1 to the end of your scheduled task command. You should then see that /tmp/scheduled_task_output contains the output of your process and your task should complete.

Let me know how that goes.

@glenn @PA

Bingo! I think that bug might be the culprit. I have been testing your suggestion for a few hours and it has been working so far. Will keep you updated in the next few days.

For anyone else reading this thread: Adding @glenn 's suggestion to the actual scheduled task command in the web interface didn't quite work for me. (Either the task runner or I are mangling the extra arguments.) However, adding @glenn 's suggestion to my scripts themselves --thus ensuring that they produce very little or no output to stdout/stderr-- indeed seems to have worked around this bug.

Excellent news, thanks for confirming! As I said over email, Glenn has been putting together a fix for the bug and if all looks good over the next few days we'll look at pushing that one through quickly.

Handling output on multiple file descriptors safely in a way which can't deadlock is surprisingly subtle. I wrote a Python class for doing this with subprocess awhile ago, in case anyone finds it useful (it's mostly proof-of-concept code, so beware the odd bug or two).

If stdout and stderr output continue to be a problem after these changes, it is possible to manually close those filehandles - this is typically done in daemon processes when forking away from the controlling terminal. A simple example of closing stdout:

>>> import os
>>> import sys
>>> print "hello"
hello
>>> print >>sys.stderr, "hello"
hello
>>> os.close(sys.stderr.fileno())
>>> print "hello"
>>> print >>sys.stderr, "hello"
hello

(You should be careful of closing both stdout and stderr in the interactive terminal because then you will lose your prompt and any output).

Of course, fixing the issues is preferable, but this might be an easy way for people to work around the problem at the start of their script until a fix is out there. The nice thing is that since you're closing the descriptor at the OS level, there's no way some errant library or hard-coded piece of Python can somehow still produce output. The disadvantage, however, is that since you're closing the descriptor at the OS level, there's no way some errant library or hard-coded piece of Python can somehow still produce output - this can make tracking down problems tricky in some cases, even with comprehensive logging.

By the way, sorry for not jumping in earlier - the combination of a week off between jobs and our home PC having blown up isn't good for my forum activity... Writing this length of message on my iPhone isn't my idea of fun, even with a bluetooth keyboard! (^_^)

@cartroo - thanks for the enlightening conversation on taking more control over stdout/stderr output.

@giles, @glenn - confirming that my scripts have now been running with the workaround, without a hitch for the few days - it's been around 100 runs. We can safely assume that the scheduled task runner bug @glenn suspected was the culprit.

Thanks for tracking the problem down. Have you pushed out the fix live?

Ricardo

no, it's in the pipeline though. will keep you posted.