How can I extract text from images?

Question

I am not talking about scanned files, but garden variety images, such as when you take a high-def picture of a blackboard at class, and it is nicely handwritten; or when you photograph a page from a recipe book and want the recipe in text format.

Any free and open software for that?

I tried tesseract, and the results were awful.

score 43 · Answer 1 · edited Apr 16 '20 at 16:35

tesseract-ocr would be the great one compared to all others. For Installation, run the below command

sudo apt-get install tesseract-ocr

Usage is tesseract filename.jpg output.txt, then it will generate output.txt file.

You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG package, where LANG is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:

tesseract mySpanishText.jpg output -l spa

Rinzwind · Accepted Answer · 2011-08-31T11:23:29.257

The act of extracting text from images is called OCR and Ubuntu has a wiki page dedicated to OCR. From that page:

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

gocr - A command line OCR
fuzzyocr - spamassassin plugin to check image attachments
libhocr0 - Hebrew OCR
ocrad - Optical Character Recognition program
ocrfeeder - Document layout analysis and optical character recognition system
ocropus - document analysis and OCR system
tesseract-ocr

The Ubuntu multiverse respositories also contain:

cuneiform - multi-language OCR system

Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.

edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.

Flimm · Answer 3 · 2023-07-28T10:15:21.527

7

Frog

Try Frog. Frog is an intuitive text extraction tool (OCR) for GNOME.

edited Jul 28 '23 at 10:15

answered Feb 16 '23 at 19:46

Flimm

44,031

Flimm · Answer 4 · 2023-02-16T19:48:48.263

TextSnatcher

Try TextSnatcher. This application uses the Tesseract OCR 4.x for the character recognition behind the scenes.

Probably the easiest way to install it on Ubuntu is to get it from Flathub:

First, if you haven't already, install Flatpak using the Ubuntu quick start guide. Remember to restart your system afterwards.
Go to TextSnatcher on Flathub and click Install. Or, if you prefer the command-line, run this command:
```
flatpak install flathub com.github.rajsolai.textsnatcher
```

score 1 · Answer 5 · answered Mar 29 '22 at 03:26

Using tesseract-ocr we can extract text from images. I have tested gocr which didn't work well as compare to tesseract-ocr

Installation:

sudo apt-get install tesseract-ocr

Python program to convert all the image files with png extension inside of current directory to txt file

#!/usr/bin/env python3.10
import os
import subprocess
def list_files(path):
    files = []
    for name in os.listdir(path):
        if os.path.isfile(os.path.join(path, name)):
            files.append(os.path.join(path, name))
    return files
def convertImageToText(img_file):
    #process = subprocess.Popen(['tesseract', img_file,
    #    ''.join(img_file.rsplit('.png', 1))])
    os.system(f"tesseract {img_file} {''.join(img_file.rsplit('.png', 1))}")
def startOperation():
    list_file = list_files(".")
    print(list_file)
    for img_file in list_file:
        if img_file.lower().split(".")[-1] == "png":
            convertImageToText(img_file)
startOperation()

How can I extract text from images?

5 Answers5

Frog

TextSnatcher

Linked

Related