Forums

Scraper performance very unstable

Hi! I am running a Flask application as a backend service to my website. In one of the routes, I use a Selenium-based scrapper to obtain some informations. Sometimes it takes something like 25s to obtain the informations (which is kind of reasonable: locally it takes 10s). But there are times that it can take almost 3 minutes. This instability hinders the functionality of the feature I am aiming to deliver, to the point of making it impossible. What are the reasons for that? Is it possible to make it stable and faster?

Can you do some profiling to figure out which parts of your selenium code is taking a long time to run? eg: the startup of the browser vs getting the site etc.

Depending on how many people you expect to be using your site, it may not make sense to keep the scraper running directly off of the website- this is because if say you have 4 workers, then if you get 4 requests that do scraping, everyone else would be blocked for 25s-3min until the 4 requests are done. And say you get 8 requests that do scraping, then everyone else would be blocked for 50s-6min etc. Instead you probably want to offload that scraping stuff to say an always on task, and just have the web request return with a "processing" type of message until it is done.

Hi, conrad! So, I finally took the time to investigate it further. Apparently, almost 1 minute is spent in this command:

self.driver.get(url)

So, it appears to be a problem with the loading of the page. Sometimes it is relatively fast (12s to load) and everything works fine, but sometimes it can be 4 times that, and that's where my app stops working properly, since the browser cuts the connection if the API doesn't respond in 30s.

Depending on how many people you expect to be using your site, it may not make sense to keep the scraper running directly off of the website- this is because if say you have 4 workers, then if you get 4 requests that do scraping, everyone else would be blocked for 25s-3min until the 4 requests are done. And say you get 8 requests that do scraping, then everyone else would be blocked for 50s-6min etc. Instead you probably want to offload that scraping stuff to say an always on task, and just have the web request return with a "processing" type of message until it is done.

This actually is not a problem, because this operation is used very sparsely. Since it won't be activated frequently, I don't see it being a burden.

Nevermind, everything is working fine now! :)

Glad to hear that!