Forums

multiprocessing

[warning: I am a python newbie]

A script harvests data from an API. It will require something like 18 hours wall clock / 1:30 CPU clock to execute, spending most of the time waiting for the API to return data.

I am trying to get better wall clock time using python's multiprocessing module.

import multiprocessing as mp
[snip] 
   while True : 
          ## create a pool and parallel process the requests
           pool = mp.Pool(processes=r)
           results = [pool.apply_async(_processURL,args=(theURLs,x)) for x in range (0,r)]
           try :
                someVar = [p.get() for p in results]
           except : pass

           ## close the pool for this iteration
           pool.close()
           if (some condition) : return
[snip]
def _processURL(theURLs,x):
   [do stuff]

Works great on my machine. On PA, CPU clock is going through the roof (3:30+). not entirely clear to me why the same amount of work is significantly more costly if run through mp. Possibly a function of the number of parallel processes. I currently launch requests in batches of 20, which could be stupid on a dual core machine (which is what I believe to be the setup at PA).

If someone has a suggestions on how to effectively run mp scripts, please educate me (I've spent quite some time on stack overflow, ending up more confused, perhaps -- packages and implementations are evolving fast such that last year's guidelines appear to be obsolete)