Tuesday, April 30, 2024
33
rated 0 times [  33] [ 0]  / answers: 1 / hits: 20289  / 2 Years ago, thu, october 27, 2022, 10:46:19

I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?



Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).




  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)

  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)

  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.

  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.



Is there a software package I am unaware of? Or a script that does this?


More From » software-recommendation

 Answers
6

As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run


sudo apt install ocrmypdf
ocrmypdf -h # to see the usage

Finally you can OCR your pdf with the command:


ocrmypdf input.pdf output.pdf  # change input and output to the files you want

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:


pdftk A=input.pdf cat A1-5 output output.pdf

If you have any question have a look in the Github repo.


[#24925] Friday, October 28, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
peratingcit

Total Points: 253
Total Questions: 122
Total Answers: 94

Location: Botswana
Member since Sat, Jan 7, 2023
1 Year ago
;