Friday, May 3, 2024
 Popular · Latest · Hot · Upcoming
1
rated 0 times [  1] [ 0]  / answers: 1 / hits: 2969  / 2 Years ago, sat, december 25, 2021, 9:53:30

I tried to add a textlayer to some pdf files in order to make them searchable. This technique is explained in the german Ubuntu wiki: http://wiki.ubuntuusers.de/pdfsandwich .
After installing dependencies



sudo apt-get install imagemagick exactimage ghostscript tesseract-ocr


and pdfsandwich itself it should be as simple as



pdfsandwich test.pdf


However I get:



Input file: "test.pdf"
Output file: "test_ocr.pdf"
Number of pages in inputfile: 272

Parallel processing with 8 threads started.
Processing page order may differ from original page order.

Processing page 137.
Processing page 171.
Processing page 1.
PProcessing page Processing pProcessing page rocess35.
age 239.
Processing page 69.
205.
ing page 103.
sh: 1: cannot open /tmp/pdfsandwich4e375e.html: No such file


followed by many more cannot open ... warnings. Inspection of my /tmpdirectory shows that instead of these *.html files the corresponding *.txt files exist. Seemingly tesseract does not output in hocr format. I read the man pages of tesseract and tried to enforce hocr output by creating a config file named tesseract-config



hocr true


(I tried various variations thereof) and starting pdfsandwich with



pdfsandwich -tesso tesseract-config test.pdf


But this does not seem to change anything. Any ideas how I can make pdfsandwich produce proper output?



Note the related questions How to add OCRed text to original pdf in gscan2pdf? and Adding OCR info to a PDF . However I need to process many pdf files and therefore I need a command-line solution which I can automate.


More From » ocr

 Answers
5

It turned out that the format of the config file changed with the present ubuntu version of tesseract (3.02.01): http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr?r=526 . Tesseract can now be instructed to output in hocr format with a single line configuration file tesseract-config:



tessedit_create_hocr 1


As noted in the question, tesseract can be instructed to read the config file by passing the -tesso option to pdfsandwich:



pdfsandwich -tesso tesseract-config test.pdf

[#30721] Sunday, December 26, 2021, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
eving

Total Points: 162
Total Questions: 102
Total Answers: 112

Location: Trinidad and Tobago
Member since Thu, Dec 1, 2022
1 Year ago
eving questions
;