Friday, May 3, 2024
10
rated 0 times [  10] [ 0]  / answers: 1 / hits: 2391  / 2 Years ago, wed, august 17, 2022, 1:14:11

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.)




This is not about OCR and not about converting image to text.


To clarify what I mean I will post a few examples.


There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:


enter image description here


The above is not what I discuss here.


What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.


The first are formed of images that look like pictures taken of book pages:


enter image description here


or


enter image description here


Such copies can hardly be re-printed on paper, as the background will be printed too.


The second ones are what one would expect from scanned text, and can be printed:


enter image description here


or


enter image description here


The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.


What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.




As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.


As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.


More From » software-recommendation

 Answers
1

scantailor is not maintained anymore but you can still build it from source and use it.


However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.


Prerequisites:


sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:


git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.




Another option would be to use Scantailor advanced. You can install it via snap ...


sudo snap install scantailor-advanced

... or flatpak.


... or via ppa.


sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced



Quick test:


enter image description here


[#1245] Thursday, August 18, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
pardsea

Total Points: 290
Total Questions: 115
Total Answers: 98

Location: Svalbard and Jan Mayen
Member since Sun, Sep 25, 2022
2 Years ago
pardsea questions
;