1

I want to run this command for all files in a directory.

tesseract /home/kong/Documents/input/248.jpg stdout --psm 1 --oem 1 --dpi 300 tsv >/home/kong/Documents/input/ocr_output/input/248.tsv

The input and output should have same number like 248.jpg and 248.tsv. I tried writing a python script and it is causing delimiter issues.

Can someone help me with this ? I am bash newbie.

This is the python script I wrote

comm = shlex.split(command)

out_dir = '/home/kong/Documents/input/ocr_output/input'


for file in tqdm(files):
    base_name = os.path.basename(file)
    number = base_name.split('.')[0]
    out_path = '>' + out_dir + '/' + number + '.tsv'
    comm[1] = file
    comm[-1] = out_path
#     tsv = number + '.tsv'
    with open(out_path, 'w') as f:
        subprocess.run(comm, shell=True, stdout=f)

2 Answers2

3

Try this:

source_dir=/your/source/dir
output_dir=/your/output/dir

cd "$source_dir" || exit

for file in *.jpg; do
  tesseract "$file" stdout --psm 1 --oem 1 --dpi 300 tsv > "$output_dir/${file%.jpg}.tsv"
done
0

Just as an alternative, you can use this script with Python 3.5 or higher.

import os
import subprocess as sp

# input directory
in_dir = '/home/kong/Documents/input/'
# output directory
out_dir = '/home/kong/Documents/input/ocr_output/input/'

# list of files in input directory
files = [f for f in os.listdir(in_dir)
         if os.path.isfile(os.path.join(in_dir, f))]

for file in files:
    # input file
    in_file = os.path.join(in_dir, file)

    basename = os.path.splitext(file)[0]
    # output file
    out_file = os.path.join(out_dir, basename + '.tsv')

    # run command and save its output to out with utf-8 encoding
    out = sp.run(['tesseract', in_file, 'stdout', '--psm', '1',
                  '--oem', '1', '--dpi', '300', 'tsv'],
                 stdout=sp.PIPE).stdout.decode('utf-8')

    # save command output to file
    with open(out_file, 'w') as f:
        f.write(out)