Forums

lxml and urllib2 failing to parse a website..

Hey. Im having trouble parsing a website url. I tried both lxml and urlib2 but they both failed. This python code i used for lxml

@app.route("/getInformation", methods=['GET'])           
def domain():

    urlList = []

    urlList.append("http://gbgfotboll.se/serier/?scr=table&ftid=57109")
    urlList.append("http://gbgfotboll.se/serier/?scr=table&ftid=57108")

    date = '2015-04-18'

    # use this in real mode: currentDate = (time.strftime("%Y-%m-%d"))

    homeScore = "0"
    awayScore = "0"
    homeTeam = ""
    awayTeam = ""

    time_xpath = XPath("td[1]/span/span//text()[2]")
    team_xpath = XPath("td[2]/a/text()")
    league_xpath = XPath("//*[@id='content-primary']/h1//text()")

    for url in urlList:
        test =  0 #remove this
        rows_xpath = XPath("//*[@id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))
        html = lxml.html.parse(url)

and the corresponding error message:

2015-03-28 12:58:20,903 :Exception on /getInformation [GET]
Traceback (most recent call last):
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/Timocin/mysite/work.py", line 48, in domain
    html = lxml.html.parse(url)
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 788, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src/lxml/lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94754)
IOError: Error reading file 'http://gbgfotboll.se/serier/?scr=table&ftid=57109': failed to load HTTP resource

And for urllib2 i used this code:

@app.route("/getInformation", methods=['GET'])
def domain():

    urlList = []

    urlList.append("http://c.gbgfotboll.se/serier/?scr=table&ftid=57109")
    urlList.append("http://gbgfotboll.se/serier/?scr=table&ftid=57108")

    date = '2015-04-18'

    # use this in real mode: currentDate = (time.strftime("%Y-%m-%d"))

    homeScore = "0"
    awayScore = "0"
    homeTeam = ""
    awayTeam = ""

    time_xpath = XPath("td[1]/span/span//text()[2]")
    team_xpath = XPath("td[2]/a/text()")
    league_xpath = XPath("//*[@id='content-primary']/h1//text()")

    #hdr = {'User-agent', 'Mozilla/5.0'}

    for url in urlList:
        test =  0 #remove this
        rows_xpath = XPath("//*[@id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))
        #html = lxml.html.parse(url)
        p = urlopen(url)
        html = parse(p)

With error message:

2015-03-28 15:15:05,087 :Exception on /getInformation [GET]
Traceback (most recent call last):
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/Timocin/mysite/env/local/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/Timocin/mysite/work.py", line 57, in domain
    p = urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

Any ideas... ?

I recently figured out that you need a payed account in order to access external websites. I upgraded my account and i still get error message:

For LXML i get:

IOError: Error reading file 'https://gbgfotboll.se/serier/?scr=table&ftid=57109': failed to load external entity "https://gbgfotboll.se/serier/?scr=table&ftid=57109"

For urllib2 i get:

URLError: <urlopen error [Errno 111] Connection refused>

If you're running that code in a web app that you haven't reloaded since you upgraded to a paid account, or if it's in a console that you started before you upgraded, you might get something like that -- might that be the problem?

Well i did reload the web app. But do i need to create a whole new web app because i upgraded? Not sure what you ment with "if its in that console"

Don't worry about the console thing, I just wasn't sure if you were doing this from a web app or from a console. Now I look closer at your stack traces, I can see that it's from a Flask web app, so if you reloaded it then that won't be the problem. You definitely don't need to create a new web app.

Hmm. When I try to access https://gbgfotboll.se/serier/?scr=table&ftid=57109 from my web browser, it fails. Are you sure it shouldn't be http://gbgfotboll.se/serier/?scr=table&ftid=57109?

...that is, HTTP instead of HTTPS?

O.M.G.. I had it on HTTP first but then because i had a free account it was still not working so i tried everything because i read somewhere i should try https and so on. Now that i got a payed account i forgot to change that back. The hours i spent for this... ahh anyways love u thanks for the help!

Yay! Glad we got it sorted out finally :-D

cant access in free account why

Free accounts have restricted Internet access; they can only access the public documented APIs on our whitelist. Paid accounts can access any site (so long as it allows connections from our servers).