Question

10

Get printer-ready black text on white background in scanned pdf files (remove grayscale or color background)

rated 0 times [ 10] [ 0] / answers: 1 / hits: 2391 / 2 Years ago, wed, august 17, 2022, 1:14:11

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.)

This is not about OCR and not about converting image to text.

To clarify what I mean I will post a few examples.

There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:

The above is not what I discuss here.

What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.

The first are formed of images that look like pictures taken of book pages:

or

Such copies can hardly be re-printed on paper, as the background will be printed too.

The second ones are what one would expect from scanned text, and can be printed:

or

The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.

What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.

As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.

As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

pardsea

Add To Favorites

Follow

Total Points: 290

Total Questions: 115

Total Answers: 98

Location: Svalbard and Jan Mayen

Member since Sun, Sep 25, 2022

2 Years ago

pardsea questions

1 Couldn't launch proton vpn

Sat, May 27, 23, 02:49, 1 Year ago

1 20.04 vs. 22.04 inside docker, with a 16.04 host == thread start failures?

Thu, Aug 25, 22, 01:41, 2 Years ago

1 messed up sshd_config, so removed openssh-server, installed openssh-server, Not replacing deleted config file /etc/ssh/sshd_config

Mon, Jun 21, 21, 23:27, 3 Years ago

1 removing netplan to use systemd-networkd directly

Tue, Nov 16, 21, 15:20, 3 Years ago

1 Problem with speedtest-cli

Fri, Feb 24, 23, 01:56, 1 Year ago

View All

answered 2 Years ago herfor · Accepted Answer

scantailor is not maintained anymore but you can still build it from source and use it.

However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.

Prerequisites:

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:

git clone https://github.com/victl/scantailor

cd scantailor

cmake .

make

sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.

Another option would be to use Scantailor advanced. You can install it via snap ...

sudo snap install scantailor-advanced

... or flatpak.

... or via ppa.

sudo add-apt-repository ppa:alex-p/scantailor

sudo apt update

sudo apt install scantailor # or scantailor-advanced

Quick test: