I tried to add a textlayer to some pdf files in order to make them searchable. This technique is explained in the german Ubuntu wiki: http://wiki.ubuntuusers.de/pdfsandwich .
After installing dependencies
sudo apt-get install imagemagick exactimage ghostscript tesseract-ocr
and pdfsandwich
itself it should be as simple as
pdfsandwich test.pdf
However I get:
Input file: "test.pdf"
Output file: "test_ocr.pdf"
Number of pages in inputfile: 272
Parallel processing with 8 threads started.
Processing page order may differ from original page order.
Processing page 137.
Processing page 171.
Processing page 1.
PProcessing page Processing pProcessing page rocess35.
age 239.
Processing page 69.
205.
ing page 103.
sh: 1: cannot open /tmp/pdfsandwich4e375e.html: No such file
followed by many more cannot open ...
warnings. Inspection of my /tmp
directory shows that instead of these *.html
files the corresponding *.txt
files exist. Seemingly tesseract does not output in hocr format. I read the man pages of tesseract and tried to enforce hocr output by creating a config file named tesseract-config
hocr true
(I tried various variations thereof) and starting pdfsandwich with
pdfsandwich -tesso tesseract-config test.pdf
But this does not seem to change anything. Any ideas how I can make pdfsandwich produce proper output?
Note the related questions How to add OCRed text to original pdf in gscan2pdf? and Adding OCR info to a PDF . However I need to process many pdf files and therefore I need a command-line solution which I can automate.