Forums

crawler

Hi

I have a python script that works just fine in my local Pycharm IDE, now I have uploaded the spcript to pythonanywhere and the script does not work, I created a virtual environment with the necessary packages (BeautifulSoup) and even I have uploaded my local venv but in both cases the Script does not work like it does in my local machine.

The code is a webcrawler, are there any restrictions for this kind of code?

Thanks

You're using a paid account, so it should work fine. What error(s) do you get?

Does it work if you run it from a Bash console?

Jim

Another thing to check if you are using the console instead of the scheduler- after you upgrade to a paying account, you will have to start new consoles for the script to work (ie. your old consoles still have restricted internet access)

I'm using new consoles created after the upgrade.

this is the code:

class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/3.0.0.11'
mopen = MyOpener()
string = mopen.open("https://boyaca.olx.com.co/nf/all-results").read()

if I print the string created for the last line I get:

You don't have permission to access "http://boyaca.olx.com.co/nf/all-results" on this server.<P> Reference #18.b42f9e41.1469549724.391b9d1

That's coming back from the other server -- your code managed to reach it, but the owners blocked the request for some reason.

yes, but the same code works in my local machine

when I print the string generated for the last line, I get the html of that url:

<!doctype html> <html> <head class="layout_head_view" data-view="layout/head" data-subId=""><meta charset="utf-8" /> <title>Todos los resultados en Boyacá</title>

Hmm, I can see a page when I visit it in my own browser too. That suggests that they've blocked access from our IPs, or, more probably from all IPs in the Amazon AWS range. Presumably they don't want to be scraped :-(

Ok,

Thanks for answer

Sadly, a lot of sites deliberately block all IPs in the AWS space.

Hi

can you please tell me if the people of www.olx.com are still blocking the ip from aws, I intend to upgrade my account but I need to know that first

Thanks

Doing a quick curl from the command line on a paid account gives a webpage with the following heading:

Join the millions who buy and sell from each other everyday in local communities around the world

So I would say that it's homepage is not blocked. But no guarantees if say you then hammer their site and they block you, or if say they discourage scraping of certain endpoints on the site (not the homepage).