Question

1

"sh: 1: cannot open /tmp/pdfsandwich4e375e.html: No such file" when using pdfsandwitch

rated 0 times [ 1] [ 0] / answers: 1 / hits: 2969 / 2 Years ago, sat, december 25, 2021, 9:53:30

I tried to add a textlayer to some pdf files in order to make them searchable. This technique is explained in the german Ubuntu wiki: http://wiki.ubuntuusers.de/pdfsandwich .
After installing dependencies

sudo apt-get install imagemagick exactimage ghostscript tesseract-ocr

and pdfsandwich itself it should be as simple as

pdfsandwich test.pdf

However I get:

Input file: "test.pdf"

Output file: "test_ocr.pdf"

Number of pages in inputfile: 272



Parallel processing with 8 threads started.

Processing page order may differ from original page order.



Processing page 137.

Processing page 171.

Processing page 1.

PProcessing page Processing pProcessing page rocess35.

age 239.

Processing page 69.

205.

ing page 103.

sh: 1: cannot open /tmp/pdfsandwich4e375e.html: No such file

followed by many more cannot open ... warnings. Inspection of my /tmpdirectory shows that instead of these *.html files the corresponding *.txt files exist. Seemingly tesseract does not output in hocr format. I read the man pages of tesseract and tried to enforce hocr output by creating a config file named tesseract-config

hocr true

(I tried various variations thereof) and starting pdfsandwich with

pdfsandwich -tesso tesseract-config test.pdf

But this does not seem to change anything. Any ideas how I can make pdfsandwich produce proper output?

Note the related questions How to add OCRed text to original pdf in gscan2pdf? and Adding OCR info to a PDF . However I need to process many pdf files and therefore I need a command-line solution which I can automate.

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

eving

Add To Favorites

Follow

Total Points: 162

Total Questions: 102

Total Answers: 112

Location: Trinidad and Tobago

Member since Thu, Dec 1, 2022

1 Year ago

eving questions

1 I do NOT want screenshots to be saved to file

Thu, Aug 11, 22, 22:42, 2 Years ago

1 How many .desktop files can be placed in the .config/autostart folder?

Wed, Mar 8, 23, 03:29, 1 Year ago

1 How can I use complex filters by protocol in tcpdump?

Fri, Oct 28, 22, 23:39, 2 Years ago

1 How to assign unallocated space to ubuntu partition

Sat, May 21, 22, 04:23, 2 Years ago

1 How to fix problems with apache2 after installing SSL certificate?

Wed, Feb 15, 23, 04:16, 1 Year ago

View All

answered 2 Years ago ionash · Accepted Answer

It turned out that the format of the config file changed with the present ubuntu version of tesseract (3.02.01): http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr?r=526 . Tesseract can now be instructed to output in hocr format with a single line configuration file tesseract-config:

tessedit_create_hocr 1

As noted in the question, tesseract can be instructed to read the config file by passing the -tesso option to pdfsandwich:

pdfsandwich -tesso tesseract-config test.pdf