What's the best, simplest OCR solution?

Question

I'd like to scan a good amount of papers I have lying around, with the least possible hassle. I would like to convert them to images using Simple Scan, then convert them to text using OCR. Is there a good OCR app with a GUI that will give me good results at the push of a button?

score 88 · Accepted Answer · edited Dec 09 '16 at 07:53

GOCR from is an OCR (Optical Character Recognition) program.It converts scanned images of text back to text files.
CLARA is another good graphical option.
OCRAD from is an OCR can be used as a stand-alone console application,or as a backend to other programs.
KOOKA from is a KDE application but works fine,in addition you have to install actual OCR programs like GOCR and OCRAD.After installing Kooka and the OCR programs,you have to point Kooka to the OCR install location in order for it to be able to convert the JPEG to text.
OCRFeeder from is a document layout analysis and optical character recognition system.
Tesseract from is Command line utility and it is very simple to use.You can install language package tesseract-ocr-eng from here.

Have a look at this page.

Note:
To run tesseract goto terminal and type the following

tesseract imagefile.tif outputfile.txt

Tesseract can only read a TIFF file - if you've got a JPEG or PDF or whatever, you'll have to convert it. Also, the filename extension must be .tif, not .tiff, otherwise tesseract errors out.

kenorb · Answer 2 · 2018-05-03T11:57:43.473

There are few popular OCR command-line tools you can use (I'm not sure if they've GUI):

Tesseract (ReadMe, FAQ) (Python)

Also available for: Tesseract .NET, Tesseract iOS

An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. Tesseract is probably the most accurate open source OCR engine available.

Usage:
```
tesseract [inputFile] [outputFile] [-l optionalLanguageFile] [PathTohOCRConfigFile]
```
GOCR

Open-source character recognition. It converts scanned images of text back to text files. GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.
OCRopus™ (FAQ) (written in Python, NumPy, and SciPy)

OCR system focusing on the use of large scale machine learning for addressing problems in document analysis, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.

OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Tessnet2 (Open source, OCR, Tesseract, .NET, DOTNET, C#, VB.NET, C++/CLI)

Tesseract is a C++ open source OCR engine. Tessnet2 is .NET assembly that expose very simple methods to do OCR. Tessnet2 is under Apache 2 license (like tesseract), meaning you can use it like you want, included in commercial products.

Few others: ABBYY CLI OCR for Linux, Asprise OCR

For more complete list, check: List of optical character recognition software at Wikipedia

See also: wanghaisheng/awesome-ocr - A curated list of promising OCR resources at GitHub.

score 15 · Answer 3 · edited Mar 08 '17 at 22:28

Gscan2PDF

OCR on multi page PDF or scanned documents

This is probably the easiest way. Gscan2pdf is a graphical tool which lets you not only scan files, but also import files and perform OCR on them. Install gscan2pdf from here , from Ubuntu Software Center or running this command in a terminal:

sudo apt-get install gscan2pdf

Run gscan2pdf
Import the pdf (Ctrl+O)
Optional: Tools > Clean up
Choose Tools > OCR Save (Ctrl+S)

Gscan2PDF can use customizable OCR engines, default is tesseract-ocr

You might consider selecting the appropriate language. In that case you will need to install tesseract-ocr-LANG package, where LANG is the three letter ISO 639-2 language code. Right now you have 108 languages on 16.04 repo.

Source

score 13 · Answer 4 · answered Nov 03 '18 at 21:24

13

Just because it works very nicely and should definitely be in the list:

gimageReader
Example from a screenshot:

It is in the repos (answered on 18.10, but have been using it for ages)

answered Nov 03 '18 at 21:24

Jacob Vlijm

85,475

Eduard Florinescu · Answer 5 · 2018-11-03T21:00:07.213

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

I think the command is pretty easy that it doesn't need any GUI. Maybe installing pypdfocr is a bit more verbose:

sudo apt install tesseract-ocr 
pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

apt install ocrmypdf

score 9 · Answer 6 · edited Mar 08 '17 at 14:07

linux-intelligent-ocr-solution

disclaimer - I am closely connected with the development of this opensource solution

Lios can convert print to text using either scanner or a camera.

It can also produce text out of scanned images from other sources such as Pdf, Image or Folder containing Images.

Program is given total accessibility for visually impaired.

Since I'm closely connected - I would love feedback.

score 3 · Answer 7 · edited Mar 08 '17 at 19:41

I have just had success (under 16.04) with pdfocr.rb. This is listed on Ubuntu wiki

Here is a ppa but the repository for 16.04 is not updated. The ruby script above from github though still works with 16.04.

You can download it from Github. You will need the following packages installed:

ruby tesseract-ocr pdftk exactimage

then made pdfocr.rb executable and ran:

./pdfocf.rb -i source.pdf -o output.pdf

Optionally you can use the -l LANG parameter. In that case you will need to install tesseract-ocr-LANG package, where LANG is the three letter ISO 639-2 language code. Right now you have 108 languages on 16.04 repo.

score 1 · Answer 8 · answered Nov 20 '14 at 15:45

1

gscan2pdf includes 3 different ocr engines. You can scan right to the program or import your pdf into the program. I've found the Tesseract engine works great, and very easy to use

answered Nov 20 '14 at 15:45

Vince West

11

score 0 · Answer 9 · answered Jul 16 '21 at 14:01

OCRFeeder has already been mentioned as one of many options, but I thought it would be worth mentioning why it fulfills your requirements:

It has a GUI (unlike some of the applications mentioned in some of the other answers)
It's easy to use (click Add Image then click Recognize Document)

In addition, it has other qualities that make it an excellent choice:

It's just a frontend and can use one of any number of backends (engines), with built-in support for CuneiForm, GOCR, Ocrad and Tesseract (https://gitlab.gnome.org/GNOME/ocrfeeder/-/blob/master/src/ocrfeeder/util/configuration.py).
It's packaged for Ubuntu (as ocrfeeder)
It's still under active development at the time of this posting
It's part of the Gnome project

What's the best, simplest OCR solution?

9 Answers9

Gscan2PDF

Update 3rd november 2018:

Linked

Related