Forums

Subprocess output from pdfimages command

I am trying to extract images from PDF files, and I successfully achieve this by running this command in a bash shell:

$ pdfimages -j "google.pdf" conv

However, when trying to run this exact same command via subprocess, within my flask app:

subprocess.Popen(["pdfimages", "-j", "-", "conv"], stdin=subprocess.PIPE).communicate(input=file.read())

It does not completes the extraction, and gives this error:

Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Why is this happening? What is the correct way to call/start a subprocess in pythonanywhere?

is there any reason why you are first loading a file and then giving it to the subprocess stdin? (why not just give it the file like you did in the bash command?)

I was trying to process the file from memory (so that I don't need to save it in the server) after upload, but it doesn't work even if I save it, and then pass the filepath to the subprocess command.

If I open a bash shell and paste the exact same command that I'm passing inside the subprocess, it works. From inside the flask route it doesn't.

I can implement this workflow locally, but here I am not able to get the subprocess to even run. I tried all kind of variations (subprocess.Popen, subprocess.run, subprocess.check_output, etc..) and none works. No errors or whatsoever, so can you tell me how to do this? Or am I thread limited because of free account? A little explanation would be nice.

Ah- how long does this conversion take? And does it use up excessive memory (eg: 3GB+)? If so your subprocess may be getting killed. Alternatively, one gotcha could be that you are using a relative file path and the subprocess isn't finding the file?

I would also double check the version of all the diff libraries you are using to make sure they are the same locally vs on PythonAnywhere. Another thing to check would be if your file is just messed up (eg: over the http request, the file content etc somehow changed).

Sorry, but that can't be the case. The pdf I am using to extract the images is a simple print of the google home page. It doesn't even take 1 second to run, via the bash shell.

Also, I am not using any external libraries besides the standard python lib.

Please check the code below, because to me it is not making sense as why it doesn't work:

1.Running via python shell:

>>>import subprocess
>>>bashCommand = "pdfimages -j 'tmp/data.pdf' conv"                                                                                                                                                                                                                                        
>>>subprocess.check_output(['bash','-c', bashCommand])
b''

And the images are correctly extracted from the pdf.

2.Making a POST to the flask app route, where I have this:

    if os.path.isfile(os.path.join(UPLOAD_FOLDER, 'data.pdf')):

        bashCommand = "pdfimages -j 'tmp/data.pdf' conv"
        subprocess.check_output(['bash','-c', bashCommand])

gives the following error:

subprocess.CalledProcessError: Command '['bash', '-c', "pdfimages -j 'tmp/data.pdf' conv"]' returned non-zero exit status 1.

os.path.join(UPLOAD_FOLDER, 'data.pdf') may be different from 'tmp/data.pdf'

try

bashCommand = "pdfimages -j '%s' conv" % os.path.join(UPLOAD_FOLDER, 'data.pdf')

CalledProcessError also provides the output of running your command, so you can use that to see what the problem was.