Adding OCR info to a PDF

Question

I have a good quality scan of a document; such scan is in pdf format.

How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.

score 25 · Answer 1 · edited Mar 10 '17 at 04:03

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

score 9 · Answer 2 · edited Feb 19 '13 at 10:02

9

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

1,903

answered Jun 07 '12 at 21:24

Aldi

91

score 8 · Answer 3 · edited Feb 28 '23 at 07:52

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

score 4 · Answer 4 · answered Feb 19 '13 at 10:31

4

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

15,833

score 2 · Answer 5 · edited Mar 15 '21 at 05:50

2

I use ocrmypdf and it just works fine.

ocrmypdf input.pdf output.pdf --force-ocr

On a raspberry Pi I have created a sh file that converts all the files within that folder. Following content:

for i in *.pdf; do ocrmypdf "$i" "$i" --force-ocr;done

I call it by executing bash convertToSearchablePDF.sh in the terminal.

edited Mar 15 '21 at 05:50

muru

207,228

answered Mar 14 '21 at 15:20

Andru

21

score 2 · Answer 6 · answered Mar 23 '14 at 20:23

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

score 1 · Answer 7 · answered Jul 07 '21 at 20:32

I needed to remove a bad OCR and reduce the size of my PDF as well; I came up with the following script using ocrmypdf and ghostscript.

#!/usr/bin/sh
TEMP_FILE="$(mktemp --suffix=.pdf)" &&
    ghostscript -q -dNOPAUSE -dBATCH -dSAFER -dPDFA=2 -dPDFACompatibilityPolicy=1 -dSimulateOverprint=true -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/ebook -dAutoRotatePages=/None -dColorImageDownsampleType=/Bicubic -dColorImageResolution=150 -dGrayImageDownsampleType=/Bicubic -dFILTERTEXT -dImageResolution=300 -sOutputFile="$TEMP_FILE" "$1" &&
    ocrmypdf "$TEMP_FILE" "$2" &&
    rm "$TEMP_FILE"

The long ghostscript line removes the text layer and makes various space-saving changes to the first argument $1, saving them to a temporary file. We then add OCR with ocrmypdf (which is an excellent tool), and output to the path given by the second argument $2.

score 0 · Answer 8 · answered Jan 17 '19 at 17:13

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex

density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

# Cleanup temp files
rm page_?????.tif page_?????.pdf

score 0 · Answer 9 · answered Feb 26 '19 at 16:38

For whole directory with ppm files you can use this script ppm2ocrpdf.sh

#!/bin/sh

mkdir .pdf
for f in *.ppm; do
    echo " Running convert -compress JPEG -quality 88 "$f" -page a4 "$f"ppm.pdf"
    convert -compress JPEG -quality 88 "$f" -page a4 "$f"ppm.pdf
    echo " Running tesseract -l deu "$f" "$f" pdf"
    tesseract -l deu "$f" "$f" pdf
    echo " Running pdftk "$f".pdf cat output ./.pdf/"$f"ocr.pdf"
    pdftk "$f".pdf cat output ./.pdf/"$f"ocr.pdf
    echo " Running rm "$f"ppm.pdf"
    rm "$f"ppm.pdf
    echo " Running rm "$f".pdf"
    rm "$f".pdf
done
echo " Running pdftk *.pdf cat output ../outdocument.pdf"
pdftk ./.pdf/*.pdf cat output outOcrDocument.pdf
echo " Running rm ./.pdf/*.pdf"
rm ./.pdf/*.pdf
echo " Running rmdir .pdf"
rmdir .pdf
echo "Done"

score 0 · Answer 10 · answered Jan 20 '24 at 14:27

0

You can use OCRthyPDF -> https://snapcraft.io/ocrthypdf

It is a frontend for ocrmypdf and available as SNAP.

answered Jan 20 '24 at 14:27

ping pong

1

Adding OCR info to a PDF

10 Answers10

pdfsandwich

Linked