How do I convert a scanned PDF into a PDF with text

Question

I have scanned about 80 pages into gray scale pdf (image format). The end size of the file is about 70MB, which is very huge.

Now I am looking for a method to convert the grayscale image-based PDF file into a simple black/white text-based PDF file.

I have done many attempts with gs but with no success (only a few percent recovery). If any expert has some idea, kindly let me know.

score 31 · Accepted Answer · edited Dec 18 '17 at 21:40

31

gImageReader is a simple GTK+ front-end to tesseract-ocr.

sudo apt-get install gimagereader tesseract-ocr

sorry for the german text

edited Dec 18 '17 at 21:40

David Foerster

36,890
56
97
151

answered Apr 21 '15 at 19:56

A.B.

92,125

score 10 · Answer 2 · edited Oct 31 '16 at 21:33

10

You can try pdfocr:

 sudo add-apt-repository ppa:gezakovacs/pdfocr
 sudo apt-get update
 sudo apt-get install pdfocr

To execute the syntax is

 pdfocr -i input.pdf -o output.pdf

where input.pdf is the name of the input file and output.pdf the output file.

By default it uses Tesseract. To install it:

 sudo apt-get install tesseract-ocr

pdfocr creates an embedded text layer.

edited Oct 31 '16 at 21:33

arne.z

103
6

answered Feb 18 '16 at 22:50

rafmunozf

187

score 5 · Answer 3 · answered May 13 '20 at 14:44

5

Have a look at OCRmyPDF that works well.

answered May 13 '20 at 14:44

aggsol

213

score 3 · Answer 4 · edited Nov 09 '19 at 21:00

pdfsandwich

It loads tesseract and others on install. It's an easy one step solution and can be scripted. It can use hocr2pdf to create a plain text pdf, but its not ready for prime time...yet. The default uses tesseract and creates a "sandwiched" pdf: image + text underneath.

The embedded image can be removed with commands like:

gs -o ocr_noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE ocr_image.pdf

but the text is hidden, so it looks like a blank page.

Loading the PDF into LibreOffice Draw exposes the text and the image can be deleted manually.

score 3 · Answer 5 · edited Jul 15 '24 at 11:51

I came across this question whilst looking to convert a scanned PDF to a text-selectable PDF. I later found pdfsandwich which I have had very good results with and I am surprised isn't featured in detail, in the answers so far.

More information is available here: http://www.tobias-elze.de/pdfsandwich/

It uses the Google-sponsored tesseract optical character recognition library behind the scenes but simplifies the PDF processing and creation steps.

As of December 2020, it is included in the official Ubuntu repositories. To install:

sudo apt update && sudo apt install pdfsandwich

To process a PDF called input.pdf:

pdfsandwich input.pdf

By default, your output will appear as something like input_ocr.pdf

On Ubuntu 20.04, it didn't work initially due to a Ghostscript permissions issue. This can be worked around by adding XML comments () around out the following lines in /etc/ImageMagick-6/policy.xml (in my file, these were lines 90 - 95):

  <policy domain="coder" rights="none" pattern="PS" />
  <policy domain="coder" rights="none" pattern="PS2" />
  <policy domain="coder" rights="none" pattern="PS3" />
  <policy domain="coder" rights="none" pattern="EPS" />
  <policy domain="coder" rights="none" pattern="PDF" />
  <policy domain="coder" rights="none" pattern="XPS" />

Reference for this fix: https://www.itechlounge.net/2020/09/web-imagickexception-attempt-to-perform-an-operation-not-allowed-by-the-security-policy-pdf/

To read the documentation:

man pdfsandwich

If your document is not English, you can add the -lang switch like this:

pdfsandwich -lang xxx input.pdf

where xxx is the language of the document you need to convert. Note that, if not present, you need to install additional language support using:

sudo apt install tesseract-ocr-xxx

score 2 · Answer 6 · answered Apr 11 '19 at 12:09

2

You could try shrinkpdf to reduce the filesize and then ocr.sh to add the text layer.

answered Apr 11 '19 at 12:09

student

2,352

score 1 · Answer 7 · answered Feb 18 '16 at 20:41

For the graphical interface suggested by @A.B. on ubuntu 14.04 you should follow:

ocr tesseract on ubuntu 14.04

or anyway, add to the repository list:

sudo add-apt-repository ppa:sandromani/gimagereader
sudo apt-get update

before this works:

sudo apt-get install gimagereader

score 0 · Answer 8 · answered Apr 10 '21 at 19:14

0

Actually the best I've found is the command pdftotext

sudo apt install poppler-utils

pretty slick and simple if you do pdftotext -layout xxx.pdf you even get the original layout preserved as text.

answered Apr 10 '21 at 19:14

Vlax

171

How do I convert a scanned PDF into a PDF with text

8 Answers8