35

I have a good quality scan of a document; such scan is in pdf format.

How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.

fdierre
  • 1,033

10 Answers10

25

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

Pablo Bianchi
  • 17,371
9

There are two projects which do the trick: GScan2PDF and OCRFeeder

Aldi
  • 91
8

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output
Artur Meinild
  • 31,035
4

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

To Do
  • 15,833
2

I use ocrmypdf and it just works fine.

ocrmypdf input.pdf output.pdf --force-ocr

On a raspberry Pi I have created a sh file that converts all the files within that folder. Following content:

for i in *.pdf; do ocrmypdf "$i" "$i" --force-ocr;done

I call it by executing bash convertToSearchablePDF.sh in the terminal.

muru
  • 207,228
Andru
  • 21
2

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

1

I needed to remove a bad OCR and reduce the size of my PDF as well; I came up with the following script using ocrmypdf and ghostscript.

#!/usr/bin/sh
TEMP_FILE="$(mktemp --suffix=.pdf)" &&
    ghostscript -q -dNOPAUSE -dBATCH -dSAFER -dPDFA=2 -dPDFACompatibilityPolicy=1 -dSimulateOverprint=true -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/ebook -dAutoRotatePages=/None -dColorImageDownsampleType=/Bicubic -dColorImageResolution=150 -dGrayImageDownsampleType=/Bicubic -dFILTERTEXT -dImageResolution=300 -sOutputFile="$TEMP_FILE" "$1" &&
    ocrmypdf "$TEMP_FILE" "$2" &&
    rm "$TEMP_FILE"

The long ghostscript line removes the text layer and makes various space-saving changes to the first argument $1, saving them to a temporary file. We then add OCR with ocrmypdf (which is an excellent tool), and output to the path given by the second argument $2.

User12345
  • 111
0

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex

density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

# Cleanup temp files
rm page_?????.tif page_?????.pdf
stefanct
  • 126
0

For whole directory with ppm files you can use this script ppm2ocrpdf.sh

#!/bin/sh

mkdir .pdf
for f in *.ppm; do
    echo " Running convert -compress JPEG -quality 88 "$f" -page a4 "$f"ppm.pdf"
    convert -compress JPEG -quality 88 "$f" -page a4 "$f"ppm.pdf
    echo " Running tesseract -l deu "$f" "$f" pdf"
    tesseract -l deu "$f" "$f" pdf
    echo " Running pdftk "$f".pdf cat output ./.pdf/"$f"ocr.pdf"
    pdftk "$f".pdf cat output ./.pdf/"$f"ocr.pdf
    echo " Running rm "$f"ppm.pdf"
    rm "$f"ppm.pdf
    echo " Running rm "$f".pdf"
    rm "$f".pdf
done
echo " Running pdftk *.pdf cat output ../outdocument.pdf"
pdftk ./.pdf/*.pdf cat output outOcrDocument.pdf
echo " Running rm ./.pdf/*.pdf"
rm ./.pdf/*.pdf
echo " Running rmdir .pdf"
rmdir .pdf
echo "Done"
0

You can use OCRthyPDF -> https://snapcraft.io/ocrthypdf

It is a frontend for ocrmypdf and available as SNAP.