52

How can I extract text from images?

I am not talking about scanned files, but garden variety images, such as when you take a high-def picture of a blackboard at class, and it is nicely handwritten; or when you photograph a page from a recipe book and want the recipe in text format.

Any free and open software for that?

I tried tesseract, and the results were awful.

Zanna
  • 72,312
Strapakowsky
  • 12,304

5 Answers5

43

tesseract-ocr would be the great one compared to all others. For Installation, run the below command

sudo apt-get install tesseract-ocr

Usage is tesseract filename.jpg output.txt, then it will generate output.txt file.

You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG package, where LANG is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:

tesseract mySpanishText.jpg output -l spa
nomadSK25
  • 113
39

The act of extracting text from images is called OCR and Ubuntu has a wiki page dedicated to OCR. From that page:

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

  1. gocr - A command line OCR
  2. fuzzyocr - spamassassin plugin to check image attachments
  3. libhocr0 - Hebrew OCR
  4. ocrad - Optical Character Recognition program
  5. ocrfeeder - Document layout analysis and optical character recognition system
  6. ocropus - document analysis and OCR system
  7. tesseract-ocr

The Ubuntu multiverse respositories also contain:

  1. cuneiform - multi-language OCR system

Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.

edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.

Rinzwind
  • 309,379
7

Frog

Try Frog. Frog is an intuitive text extraction tool (OCR) for GNOME.

screenshot

Get it from the Snap Store Download on Flathub

Flimm
  • 44,031
2

TextSnatcher

Try TextSnatcher. This application uses the Tesseract OCR 4.x for the character recognition behind the scenes.

Screenshot

Probably the easiest way to install it on Ubuntu is to get it from Flathub:

  1. First, if you haven't already, install Flatpak using the Ubuntu quick start guide. Remember to restart your system afterwards.

  2. Go to TextSnatcher on Flathub and click Install. Or, if you prefer the command-line, run this command:

    flatpak install flathub com.github.rajsolai.textsnatcher
    
Flimm
  • 44,031
1

Using tesseract-ocr we can extract text from images. I have tested gocr which didn't work well as compare to tesseract-ocr

Installation:

sudo apt-get install tesseract-ocr

Python program to convert all the image files with png extension inside of current directory to txt file

#!/usr/bin/env python3.10
import os
import subprocess

def list_files(path): files = [] for name in os.listdir(path): if os.path.isfile(os.path.join(path, name)): files.append(os.path.join(path, name)) return files

def convertImageToText(img_file): #process = subprocess.Popen(['tesseract', img_file, # ''.join(img_file.rsplit('.png', 1))]) os.system(f"tesseract {img_file} {''.join(img_file.rsplit('.png', 1))}")

def startOperation(): list_file = list_files(".") print(list_file) for img_file in list_file: if img_file.lower().split(".")[-1] == "png": convertImageToText(img_file)

startOperation()