Forums

How to run scrapy with splash on PythonAnywhere

I would like to run scrapy with splash on PythonAnywhere. I have succesfully installed scrapy itself.

Is it possible to install docker on PythonAnywhere? This is the way I have installed splash on my own machine and it is the recommended way to install and run splash. I haven't been able to find any info about running docker on PythonAnywhere and therefore I haven't succeeded in installing splash via docker.

Instead, I tried installing splash manually but it doesn't work. The installation went fine (pip install splash) but I cannot start splash. See error message below.

(scrapy36) 18:58 ~/gds/gds $ python3 -m splash.server
Traceback (most recent call last):                                                                                                                  
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main                                                                              
    "__main__", mod_spec)                                                                                                                           
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code                                                                                         
    exec(code, run_globals)                                                                                                                         
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/splash/server.py", line 11, in <module>                                    
    from splash.qtutils import init_qt_app                                                                                                          
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/splash/qtutils.py", line 15, in <module>                                   
    from PyQt5.QtWebKit import QWebSettings                                                                                                         
ModuleNotFoundError: No module named 'PyQt5.QtWebKit'                                                                                               
(scrapy36) 18:58 ~/gds/gds $

no- unfortunately you cannot run your own docker images on PythonAnywhere. What does splash do? My impression of PyQt is that it is a GUI toolkit- which is meaningless in a headless server environment.

From the Splash documentation:

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing to take advantage of webkit concurrency via QT main loop.

The websites I need to scrape are heavily javascript-based, so vanilla scrapy won't do the job. I need Splash to render the websites into HTML the same way my Chrome browser does.

I am not very experienced at this and I cannot explain why Splash needs PyQt.

You can try using scrapy with the Firefox that is already installed on PythonAnywhere.

Ok, I gave up on Splash even though it seems really nice. Instead I tried Selenium with the Firefox driver. The example code worked fine:

from pyvirtualdisplay import Display
from selenium import webdriver


with Display():
    # we can now start Firefox and it will run inside the virtual display
    browser = webdriver.Firefox()

    # put the rest of our selenium code in a try/finally
    # to make sure we always clean up at the end
    try:
        browser.get('http://www.google.com')
        print(browser.title) #this should print "Google"

    finally:
        browser.quit()

When I run it, I get:

(scrapy36) 19:17 ~/cbb $ python selenium_on_google.py 
Google
(scrapy36) 19:21 ~/cbb $

However, if I change the line

    browser.get('http://www.google.com')

to

    browser.get('https://www.cbb.dk/shop/mobiltelefoner/')

which is the website, I would like to scrape, it raises an exception:

(scrapy36) 19:21 ~/cbb $ python selenium_on_cbb.py 
Traceback (most recent call last):
  File "selenium_on_cbb.py", line 12, in <module>
    browser.get('https://www.cbb.dk/shop/mobiltelefoner/')
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 248, in get
    self.execute(Command.GET, {'url': url})
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/robinhat/.virtualenvs/scrapy36/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 433, in _request
    resp = self._conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
(scrapy36) 19:23 ~/cbb $

I can't find out why this happens on this particular site I am trying to scrape. Other sites (including https:// sites) work well. Is it because Pythonanywhere has an old version of Firefox that cannot handle https://www.cbb.dk/:? What can I do to solve my problem?

It does look a bit like the browser is crashing or is not able to respond to selenium when it's visiting the site.

Was there ever a resolution to this topic?

in general the scrapy + selenium solution works fine. I don't think there was a resolution for problems specific to https://www.cbb.dk/shop/mobiltelefoner/