Questions tagged [tesseract]

An open-source optical character recognition engine

Tesseract is an open-source optical character recognition engine. Character data sets for various scripts and languages pre-exist and the engine allows training of additional (custom) data sets.

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.

20 questions
5
votes
2 answers

ocrfeeder doesn't detect anything

When I try to detect text on my jpeg, it shows correctly all areas where it suspects text and images, but when I export it to ODT it only creates an ODT with empty text- and imageframes. Do I have to configure tesseract somehow? (I use Ubuntu 14.10…
rubo77
  • 34,024
  • 52
  • 172
  • 299
5
votes
2 answers

What program is suitable for making scanned PDF files searchable?

I would like to be able to scan paper documents to PDF files and make the text searchable. I believe the Tesseract program can assist this, but don't know how to begin, and don't know what would be the best program to use. Is anybody making…
3
votes
1 answer

How to improve tesseract performance?

By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance: 2 3 academics 1 1711 2 3 Achlmbobelmann 211 191—2 1 3 Aoqusmono|Food…
katriel
  • 457
3
votes
0 answers

Tesseract giving errors

This morning I tried to use tesseract and I'm getting the following error messages: $ tesseract --list-langs Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not…
To Do
  • 15,833
2
votes
0 answers

OCR with two-page layout

I'm trying to do OCR on a pdf with a two-page layout - in a landscape-orientation page of the PDF, the left half is one (portrait-orientation) page, the right half is the next (portrait-orientation) page. Sometimes the layout messes up tesseract.…
Raffi
  • 121
1
vote
1 answer

Tesseract ocr - problems finding languages

I was having problems with Ubuntu 22.04 on my Framework laptop, and did a complete re-installation, using Ubuntu 24.04.1. I have just reinstalled Tesseract using snap. It was was working ok previously on Ubuntu 22.04, but now gives the…
1
vote
0 answers

I'm having trouble installing OCRopy, I want to use it to create train data for an old manuscript in latin. What prereqs are needed and lines to write

So I am new to using Ubuntu and I am trying to install OCRopy to make train data with the end goal of creating a transcript for a 15th c. manuscript. So far I am considering that my problem may be a lack of prerequisites. I have installed python3…
mumbot
  • 11
1
vote
1 answer

Cannot make .box files -Training Tessearct

I am trying to train Tesseract in Ubuntu 20.04.1 LTS.I have downloaded tesseract and the training tools required. For the training data I am using jTessBoxEditor.I have the .tiff files but I am unable to make the .box files.When I type the following…
Hula
  • 11
1
vote
2 answers

How to write bash script to run the same command for all files in a directory

I want to run this command for all files in a directory. tesseract /home/kong/Documents/input/248.jpg stdout --psm 1 --oem 1 --dpi 300 tsv >/home/kong/Documents/input/ocr_output/input/248.tsv The input and output should have same number like…
1
vote
1 answer

Can Qt-box-editor be used for tesseract 4.0?

I am using tesseract 4.0 for character recognition. In many blogs, it is written that Qt-box-editor can be used with tesseract 3.x. My question is:- Can Qt-box-editor be used with tesseract 4.0?
1
vote
3 answers

Tesseract -tessdata-dir option not working in ubuntu 18.04

I am trying to use the best model from tesseract. However, I am getting the following error: tesseract sample.jpg stdout --tessdata-dir tessdata/ Error opening data file tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment…
Monster
  • 21
1
vote
3 answers

Ubuntu 18.04 error install tesseract

I've installed Ubuntu 18.04. I've installed tesseract using sudo apt-get install tesseract-ocr When I type: tesseract -v I had an error: tesseract: symbol lookup error: /usr/lib/x86_64-linux-gnu/libtesseract.so.4: undefined symbol:…
0
votes
2 answers

How can I get Tesseract OCR to recognise the large digits of an electricity meter?

I want to use an OCR program on an RPi to recognise the digits from a photo of my electricity meter. The digits are large and are very obvious to me, but Tesseract appears unable to recognise them at all - at best it detects a few random wrong…
0
votes
1 answer

KDE Wayland: Taking region screenshots faster?

I'm using this script from HN* to select regions on the screen and copy their text, I took out the line with mogrify. It uses spectacle but it takes a moment before opening the UI, is it possible and would it be faster if Spectacle stayed open in…
0
votes
1 answer

How can I get tesseract-ocr v5 to find the eng.traineddata file?

Ubuntu 22.04.3 LTS tesseract 5.3.2 XSane 0.999 YAGF 0.9.5 Epson Workforce WF-4835 printer/scanner This set up works together to a point. Clicking the Scan button in YAGF causes XSane to start up, scan the document in the scanner, and display the…
1
2