Question

33

How to turn a pdf into a text searchable pdf?

rated 0 times [ 33] [ 0] / answers: 1 / hits: 20289 / 2 Years ago, thu, october 27, 2022, 10:46:19

I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?

Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).

pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)

pdfsandwich (of which the software center says it is a poor package and I should not install it)

OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.

Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

peratingcit

Add To Favorites

Follow

Total Points: 253

Total Questions: 122

Total Answers: 94

Location: Botswana

Member since Sat, Jan 7, 2023

1 Year ago

answered 2 Years ago riffnkful · Accepted Answer

As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run

sudo apt install ocrmypdf

ocrmypdf -h   # to see the usage

Finally you can OCR your pdf with the command:

ocrmypdf input.pdf output.pdf  # change input and output to the files you want

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:

pdftk A=input.pdf cat A1-5 output output.pdf

If you have any question have a look in the Github repo.