[warning: I am a python newbie]
A script harvests data from an API. It will require something like 18 hours wall clock / 1:30 CPU clock to execute, spending most of the time waiting for the API to return data.
I am trying to get better wall clock time using python's multiprocessing module.
import multiprocessing as mp [snip] while True : ## create a pool and parallel process the requests pool = mp.Pool(processes=r) results = [pool.apply_async(_processURL,args=(theURLs,x)) for x in range (0,r)] try : someVar = [p.get() for p in results] except : pass ## close the pool for this iteration pool.close() if (some condition) : return [snip] def _processURL(theURLs,x): [do stuff]
Works great on my machine. On PA, CPU clock is going through the roof (3:30+). not entirely clear to me why the same amount of work is significantly more costly if run through mp. Possibly a function of the number of parallel processes. I currently launch requests in batches of 20, which could be stupid on a dual core machine (which is what I believe to be the setup at PA).
If someone has a suggestions on how to effectively run mp scripts, please educate me (I've spent quite some time on stack overflow, ending up more confused, perhaps -- packages and implementations are evolving fast such that last year's guidelines appear to be obsolete)