Forums

Selenium issue

Hi, I have an issue with Selenium not waiting for the dynamic web page to download, and giving me an error when I try to force it to wait. The section of my code is as follows:

import datetime
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from pyvirtualdisplay import Display


URL = 'https://coronavirus.data.gov.uk/#local-authorities'
datestr=datetime.datetime.now().strftime("%Y%m%d")

#firefox driver for selenium from: https://github.com/mozilla/geckodriver/releases

with Display():
    driver = webdriver.Firefox()
    try:
        driver.get(URL)
        wait = WebDriverWait(driver, 30)
        #wait for the page to load completely
        element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))
        time.sleep(10)
    finally:
        page = driver.page_source
        driver.quit()

I get the following error messages:

xdpyinfo was not found, X start can not be checked! Please install xdpyinfo! Traceback (most recent call last): File "/home/markaut/mysite/downloader2.py", line 33, in <module> element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]"))) AttributeError: module 'selenium.webdriver.support.expected_conditions' has no attribute 'visibility_of_all_elements_located'

Annoyingly it works fine on my windows desktop machine. Any suggestions would be very much appreciated.

We could switch you to a new virtualization system that we deployed recently. It works with Chrome in headless mode.

Doing that to make it run headless properly would be very much appreciated.

Would it then solve the 'visibility_of_all_elements_located' problem?

Ok. I have swtiched your account to use the new virtualisation system. It will start working in new consoles and after you've reloaded your web app.

I think the missing visibility_of_all_elements_located is because you are using a different version of selenium. Make sure that you're using the version of selenium that the code was written for.

Hi Glenn, Many thanks for making the change running headless in chrome is far better.

On my desktop I'm using selenium version 3.141.0 and the code works perfectly. I updated PythonAnywhere and it's using the same version of Selenium. (selenium.version returns 3.141.0) When I run my code on PythonAnywhere it fails on the line:

element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))

now with the error message: Traceback (most recent call last):

  File "/home/markaut/mysite/downloader3.py", line 34, in <module>
    element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, 
"/html/body/div/div[2]/div[6]/div/div[3]")))
  File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

So is this a different issue? I'm assuming it's still a selenium problem? Any help would be very much appreciated as I keep tarpitting myself trying random things. Clearly there is a difference somewhere between my desktop environment and the one on PythonAnywhere, but I'm at a loss to find it.

Could you print out the browser body content and try to debug from there?

I see that you have a free account- is the website you are trying to access on our whitelist?

AHA! Some progress.

I'm trying to access https://coronavirus.data.gov.uk/#local-authorities , so I'd thought it should be OK. Do the scripts it uses also need to be on whitelisted sites?

The browser isn't rendering the html, all I get is the javascript.- Hence my problem.

URL = 'https://coronavirus.data.gov.uk/#local-authorities'

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(options=chrome_options)
try:
    driver.get(URL)
    wait = WebDriverWait(driver, 50)
    element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))
    time.sleep(1)
finally:
    page = driver.page_source
    driver.quit()

print(page)

Everything on the internet tells me that wait.until(EC.visibility_of_all_elements_located( .....) should wait for the page to render correctly, but It's falling over on this line with the above error.

What is the "above error"?

.

File "/home/markaut/mysite/downloader3.py", line 34, in <module>
    element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, 
"/html/body/div/div[2]/div[6]/div/div[3]")))
  File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

[edit by admin: formatting]

It does look like you're trying to access a non-whitelisted site, so the HTML elements you're looking for won't be there -- the page that will be appearing will be a "this site isn't on the whitelist" one rather than the page you're expecting.

We can only whitelist sites if they are an official documented public API. If that site has one, then please send us a link to the API documentation, and we can consider adding it to the whitelist.

I'm starting to think a paid account would be a small price to pay if it fixed it! However I don't get "This site isn't on a white list".....

I'm looking at a .data.gov.uk site so I think it is whitelisted.

The server returns the basic html, but none of the scripts on it run so the page doesn't render correctly. Below is the first bit of what is returned from print(driver.page_source) :

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en"><head><meta charset="utf-8" /><link rel="icon" href="/favicon.ico" /><meta name="viewport" content="width=device-width,initial-scale=1" /><meta name="theme-color" content="#000000" /><meta name="description" content="GOV.UK Coronavirus dashboard" /><link rel="apple-touch-icon" href="/favicon.png" /><link rel="manifest" href="/manifest.json" /><title>Coronavirus (COVID-19) cases in the UK</title><script async="" src="https://www.google-analytics.com/analytics.js"></script><script>!function(e,a,t,n,c,g,o){e.GoogleAnalyticsObject=c,e.ga=e.ga||function(){(e.ga.q=e.ga.q||[]).push(arguments)},e.ga.l=1*new Date,g=a.createElement(t),o=a.getElementsByTagName(t)[0],g.async=1,g.src="https://www.google-analytics.com/analytics.js",o.parentNode.insertBefore(g,o)}(window,document,"script",0,"ga")</script><link href="/static/css/2.7f75bbe2.chunk.css" rel="stylesheet" /><link href="/static/css/main.e7652e3b.chunk.css" rel="stylesheet" /><style data-styled="active" data-styled-version="5.0.1"></style></head>

Hmm, you're right! That site is whitelisted, and I was wrong in my previous comment. And I can see that you're actually getting a page back from the site that you expect, not our whitelist-guarding proxy's error page.

That's weird, though -- you're using Chrome so the JS should run fine. I wonder if it's trying to use scripts from a non-whitelisted site. That would be unusual for a government web page -- they normally have everything in the same place -- but it's not impossible. Do you get any JS errors back if you print driver.get_log("browser")?

OK, I'm well out of my depth now! If I understand the output correctly, I'm getting lots of connections refused which would explain it not rendering properly, but I'm none the wiser as to why. - Sorry for being clueless here.

The result of print(driver.get_log("browser)) is:

Traceback (most recent call last):
File "/usr/lib/python3.8/site-packages/urllib3/connection.py", line 158, in _new_conn
conn = connection.create_connection(
File "/usr/lib/python3.8/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/usr/lib/python3.8/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 597, in urlopen
httplib_response = self._make_request(conn, method, url,
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.8/http/client.py", line 1230, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1276, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1225, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1004, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 944, in send
self.connect()
File "/usr/lib/python3.8/site-packages/urllib3/connection.py", line 181, in connect
conn = self._new_conn()
File "/usr/lib/python3.8/site-packages/urllib3/connection.py", line 167, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa40ff058e0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/markaut/mysite/downloader3.py", line 56, in <module>
print(driver.get_log("browser"))
File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 1262, in get_log
return self.execute(Command.GET_LOG, {'type': log_type})['value']
File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 397, in _request
resp = self._conn.request(method, url, body=body, headers=headers)
File "/usr/lib/python3.8/site-packages/urllib3/request.py", line 70, in request
return self.request_encode_body(method, url, fields=fields,
File "/usr/lib/python3.8/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/lib/python3.8/site-packages/urllib3/poolmanager.py", line 324, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 663, in urlopen
return self.urlopen(method, url, body, headers, retries,
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 663, in urlopen
return self.urlopen(method, url, body, headers, retries,
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 663, in urlopen
return self.urlopen(method, url, body, headers, retries,
File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 637, in urlopen
retries = retries.increment(method, url, error=e, _pool=self,
File "/usr/lib/python3.8/site-packages/urllib3/util/retry.py", line 399, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=41185): Max retries exceeded with url: /session/bfc340f5b48fb549eb1836d976a9bf16/log (Caused by 
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa40ff058e0>: Failed to establish a new connection: [Errno 111] Connection refused'))

That's really odd -- I would read that "connection refused" as being Selenium trying to talk to Chrome, but Chrome having exited. Could you post the code that you're using at the moment?

This is the top half of my file. The rest does a bit of beautiful soup stuff, but we don't get that far yet.

import time
import datetime
from bs4 import BeautifulSoup
import pandas as pd
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#from selenium.common.exceptions import NoSuchElementException

URL = 'https://coronavirus.data.gov.uk/#local-authorities'
datestr=datetime.datetime.now().strftime("%Y%m%d")

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
#chrome_options.add_argument("--disable-gpu")

print("selenium version", selenium.__version__)
#with Display():
driver = webdriver.Chrome(options=chrome_options)

try:
    driver.get(URL)

    #wait for the page to load completely         
    #The two lines below commented out to allow debugging. For BS to work they should be un-commented.
    #wait = WebDriverWait(driver, 50)
    #element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))

finally:
    page = driver.page_source
    driver.quit()

#print(page)   #dump what is returned from the server.
print('getting driver logs....')
print(driver.get_log("browser"))

#then some beautifulsoup stuff

Sorry my code is a bit amateurish, I know I probably shouldn't be debugging with print. As mentioned above, it works fine on my development box.

If there was an incompatibility between the chrome driver and chrome, or it couldn't find the driver, would I see these sort of error messages?

The problem with that code is that you're quitting the browser before trying to get the log -- you need this line:

print(driver.get_log("browser"))

...to come before this one:

driver.quit()

GAH! Stupid me!

The log output is much simpler:

[{'level': 'SEVERE', 'message': 'https://coronavirus.data.gov.uk/static/media/light-94a07e06a1-v2.94a07e06.woff2 - Failed to load resource: the server responded with a status of 404 ()', 'source': 'network', 'timestamp': 1587370827407}]

Putting that link into a browser on my dev box causes the woff2 file to download (is it a font file?)

When I run get_log on my local machine I get some warnings, indicating that the website is missing a css declaration or two, but it still works OK.

Hmm when I just went to that file from my browser I get this, and a 404 status.

The resource you are looking for has been removed, had its name changed, or is temporarily unavailable.

Perhaps that url is flakey?

Yeah, I get the same when I try from my zscaler protected works machine, but from my other computers it works fine. The link even works on my phone. Is there a filter or something in PythonAnywhere land which would stop this? Could it be a paid account / free version thing?

Oddly my works machine can still open the main site 'https://coronavirus.data.gov.uk/#local-authorities', even if it can't open the woff2 file.

I have found a link to the woff2 file, it is in a stylesheet.... is it possible to disable / ignore stylesheet errors with a headless chrome?

Just had a quick search on the internet..... Could this explain it? https://hotcakescommerce.zendesk.com/hc/en-us/articles/210926903-HTTP-404-Not-Found-Error-with-woff-or-woff2-Font-Files

Hmm, odd. We certainly don't filter stuff like that, and I can confirm that I get the same error when I try to access it from my own machine.

However, if it's just a font then I don't think it can be the cause of the problem. I'd expect the page to be rendered even without the fonts, it would just look a bit rubbish -- which doesn't matter for the purposes of webscraping.

Let's try another thing. If you put the following code in place of the stuff you have to print out the JS logs, it will dump a screenshot into your home directory which you can load up -- perhaps we'll see what's going on then:

driver.get_screenshot_as_file("/home/markaut/screenshot.png")

OK, when I do that I get the cookies banner (which I don't do anything with on my local setup) and the footer. The active content is missing. https://ibb.co/80vzLDx

  • when I accessed it on my laptop at home, directly from the browser, that was the error that I got.

But you can access this site ok? https://coronavirus.data.gov.uk/#local-authorities this is what I'm trying to scrape.

Is the government site broken somehow? Have they put something in there to stop scrapers?

It's annoying as I have code running stably on my Dev box, but can't host it.

Do you have code to wait for the active content to load? It may take longer for the content to load than it does for your program to move on to the screenshot.

Yeah, which is where we started.....

wait = WebDriverWait(driver, 50)   #behaves the same even if increased to 150
element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))

is supposed to do this, but it fails with the error:

Traceback (most recent call last):
  File "/home/markaut/mysite/downloader3.py", line 32, in <module>
element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))
  File "/home/markaut/.local/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

It seems the whole page doesn't load, probably as it falls over due to not getting the woff2 file.

the whole code section: (line 32 is element=wait.....)

driver = webdriver.Chrome(options=chrome_options)

try:
    driver.get(URL)
    wait = WebDriverWait(driver, 50)
    #driver.implicitly_wait(30)
    #wait for the page to load completely
    element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div/div[2]/div[6]/div/div[3]")))

finally:
    page = driver.page_source
    #print(driver.get_log("browser"))
    driver.get_screenshot_as_file("/home/markaut/screenshot2.png")
    driver.quit()
#print(page)

I think I've tracked down some additional public Azure APIs that the page is using, and whitelisted them. Could you try running your code again now and see if it works?

You absolute genius Thank you very much. You've cracked it. Thanks to everyone who has helped.

Brilliant! Thanks for confirming :-)

Can you please switch my account to the chrome virtualization as well?

@ssmif -- Sure, but we need to upgrade your system image first. Changing the system image upgrades a lot of the pre-installed Python packages, any code that you have that relies on those packages might break if it's not compatible with the new versions. Also, because the new image has newer versions of Python, if you have any virtualenvs, you may need to rebuild them. If you're happy to go ahead despite that, just let us know and we'll switch you over.

could I get my account upgraded too please? thanks!!

@ettyb it's activated for you.

Hello, can I have my account upgraded as well?

@MonitorMafia your account already has the most recent system image, and also supports the new virtualization system. What is the issue you're trying to resolve?

@giles "xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!"

when using selenium and pyvirtualdisplay

That's just a warning, you can normally ignore that.

That said, if you're using Chrome then you can use it in headless mode without pyvirtualdisplay: you'll need to upgrade Selenium for your account -- for example, if you're using Python 3.7, run this in Bash:

pip3.7 install --user --upgrade selenium

...and then you can run Selenium with Chrome in headless mode using code like this:

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)

try:
    browser.get("https://www.google.com")
    print("Page title was '{}'".format(browser.title))

finally:
    browser.quit()