Forums

Textract package does not work on website

I am trying to host a website via Pythonanywhere that allows me to extract 8 digit product numbers from PDF files. The extracting is done by the Textract package.

When I try to extract certain PDF files I get a UnicodeDecodeError like you can see here: https://codeshare.io/WdNz6E. The problem is, if I run the website locally via the development server, the same PDF files are no problem for Textract and the program runs without errors. The versions of Flask 1.1.2 and Textract 1.6.3 are the same. Python is version 3.8.2 locally and 3.8 used in Pythonanywhere. With Pycharm version 2020.3.5 in Big Sur 11.4 the Flask app runs flawless on the development server.

This is the function used to extract the pdf files:

def get_numbers(file_path):
    output = ''
    pdf_string = str(textract.process(file_path))
    numbers_list = re.sub('\D', ' ', pdf_string).split()
    for x in numbers_list:
        if len(x) == 8 and x not in output:
            output += f'{x} '
    return output.rstrip()

What is the exact error you see in your error log?

You can see the error in the codeshare.io link I provided.

The link doesn't work.

Did you accidentally copy the . at the end of link? It works for me. Anyway here is the error:

2021-06-09 17:00:16,030: Exception on / [POST]
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./flask_app.py", line 18, in my_page
    output_data = get_numbers(file_path)
  File "./processing.py", line 7, in get_numbers
    pdf_string = str(textract.process(file_path))
  File "/usr/lib/python3.8/site-packages/textract/parsers/__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "/usr/lib/python3.8/site-packages/textract/parsers/utils.py", line 47, in process
    unicode_string = self.decode(byte_string)
  File "/usr/lib/python3.8/site-packages/textract/parsers/utils.py", line 65, in decode
    return text.decode(result['encoding'])
  File "/usr/lib/python3.8/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9e in position 9201: character maps to <undefined>

[edit by admin: formatting]

How are you getting the PDF files? Are they something you're uploading to the filesystem yourself (either through our web interface, or through some kind of post action on your Flask site)? Or are they something that -- say -- you're downloading with requests from some other site?

I am uploading them myself.

It looks like there's some discussion on an issue on the textract github that might help you: https://github.com/deanmalmgren/textract/issues/164

I read through the github issue and couldn't really find a solution to my problem. I think the problem is more about why it works on the development server and why it doesn't work on the website. Maybe we should just try to reset my website completely.

Sure, you can try that.