Forums

pypdfocr - How to use it within a script ?

Hi,

As you probably guessed, I am a beginner at python.

I was wondering if anybody could help me with the module pypdfocr. I have looked everywhere and I cannot find a way to use this module within python script. (The "usage" exemple were unfortunately not helpful).

What I want to achieve is to write a small python script to ocr a scanned pdf and then analyse the data from the file (this last part is not an issue). Pypdfocr seems to be the perfect module for this, but I cannot manage to make it work.

Any advice or a tiny example would be just great!

Thanks in advance for your help. PythonAnywhere's community rocks!

Unfortunately it looks like pypdfocr just doesn't work on PythonAnywhere right now :-(

The problem is that although it is installed (I think as part of something else) it depends on a number of other packages. Most of these we could install, but (as I just discovered while experimenting in a development version of our system) it depends on pillow, which is an image-processing library that can't be installed side-by-side with the one we currently have, PIL.

I'm not sure what to suggest :-(

Maybe use pdftotext instead? We have that installed, and I think at least one customer has been using it successfully.

Thank you very much for your answer. That was really fast!

Pdftotext would probably do, as well.

You said that it was already installed but I could not find it in the installed modules list. Where is it ?

Actually, is it a module ? If not, could you give me a hint on how to call it within a python script ? Do I use the subprocess.call module ?

Thanks again for your help!

That's right, subprocess.call should do the trick, it's a command-line tool. You could try experimenting with it from a bash console first -- that should be a good way to sanity-check that it works against any sample PDFs you have.

Hi, i realise this is an old thread, i was just curious since i saw pillow being one of the installed modules, does this mean that pypdfocr will work successfully now? Thanks!

I don't think anything substantial has changed since the original thread; we had Pillow installed back then too.

oh,alright. thanks!

Hey, I'm going to revive this dinosaur instead of making a new one.

I successfully used pypdfocr from the terminal to convert a pdf file, which leads me to believe that it does work on pythonanywhere.

However I am having trouble calling pypdfocr using subprocess.call from within a webapp. Here's the error I'm getting, in case it helps

2016-05-23 21:58:44 Starting conversion of /home/crashoverride947/tmp/test_pdf.pdf#012Using 300 DPI
2016-05-23 21:58:44 /bin/sh: 1: Cannot fork#012#012WARNING: Tesseract-OCR execution failed!#012ERROR: Tesseract-OCR execution failed!
2016-05-23 21:58:45 Traceback (most recent call last):
2016-05-23 21:58:45   File "/home/crashoverride947/.local/bin/pypdfocr", line 11, in <module>
2016-05-23 21:58:45     
2016-05-23 21:58:45 sys.exit(main())
2016-05-23 21:58:45   File "/home/crashoverride947/.local/lib/python2.7/site-packages/pypdfocr/pypdfocr.py", line 492, in main
2016-05-23 21:58:45     
2016-05-23 21:58:45 script.go(sys.argv[1:])
2016-05-23 21:58:45   File "/home/crashoverride947/.local/lib/python2.7/site-packages/pypdfocr/pypdfocr.py", line 474, in go
2016-05-23 21:58:45     
2016-05-23 21:58:45 self._convert_and_file_email(self.pdf_filename)
2016-05-23 21:58:45   File "/home/crashoverride947/.local/lib/python2.7/site-packages/pypdfocr/pypdfocr.py", line 480, in _convert_and_file_email
2016-05-23 21:58:45     
2016-05-23 21:58:45 ocr_pdffilename = self.run_conversion(pdf_filename)
2016-05-23 21:58:45   File "/home/crashoverride947/.local/lib/python2.7/site-packages/pypdfocr/pypdfocr.py", line 359, in run_conversion
2016-05-23 21:58:45     
2016-05-23 21:58:45 hocr_filenames = self.ts.make_hocr_from_pnms(preprocess_imagefilenames)
2016-05-23 21:58:45   File "/home/crashoverride947/.local/lib/python2.7/site-packages/pypdfocr/pypdfocr_tesseract.py", line 147, in make_hocr_from_pnms
2016-05-23 21:58:45     
2016-05-23 21:58:45 pool.join()
2016-05-23 21:58:45   File "/usr/lib/python2.7/multiprocessing/pool.py", line 460, in join
2016-05-23 21:58:45     
2016-05-23 21:58:45 assert self._state in (CLOSE, TERMINATE)
2016-05-23 21:58:45 AssertionError
2016-05-23 21:58:45

We do not allow subprocess.call in webapps. You can try doing it without a subprocess.call in real time, or you could put it on a queue to be executed later instead of immediately, and say use our scheduled tasks infrastructure to process it (not in real time), and only later accessing it.

Oh, I did not know about that subprocess limitation. By "Real Time", do you mean manually?

I'm trying to get some scanned PDFs an OCR layer; trying to do the whole chain with python modules has produced nothing but garbage; for some reason pypdfocr is the only tool that's producing sensible output, so that's why I'm trying the ugly move of calling subprocess.call.

Thanks for the quick response!

No I meant real time as in, you go to a webpage, hit a button, and it gives you the result. Non real time would be you upload the pic etc. and then wait for a bit and later on you get the result.