39

Does anyone know of an application that can convert audio to text?

Kopano
  • 391

5 Answers5

34

The software you can use is Vosk-api, a modern speech recognition toolkit based on neural networks. It supports 7+ languages and works on variety of platforms including RPi and mobile.

First you convert the file to the required format and then you recognize it:

ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav

Then install vosk-api with pip:

pip3 install vosk

Then use these steps:

git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav > result.json

The result will be stored in json format.

The same directory also contains an srt subtitle output example, which is easier to evaluate and can be directly useful to some users:

python3 -m pip install srt
python3 ./test_srt.py test.wav

The example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:

one zero zero zero one
nine oh two one oh
zero one eight zero three

The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".

The SRT generated above reads:

1
00:00:00,870 --> 00:00:02,610
what zero zero zero one

2 00:00:03,930 --> 00:00:04,950 no no to uno

3 00:00:06,240 --> 00:00:08,010 cyril one eight zero three

so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.

Next I also tried with the vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:

mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model

and the result was:

1
00:00:00,840 --> 00:00:02,610
one zero zero zero one

2 00:00:04,026 --> 00:00:04,980 i know what you window

3 00:00:06,270 --> 00:00:07,980 serial one eight zero three

which got one more word correct.

Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3.

15

I know this is old, but to expand on Nikolay's answer and hopefully save someone some time in the future, in order to get an up-to-date version of pocketsphinx working you need to compile it from the github or sourceforge repository (not sure which is kept more up to date). Note the -j8 means run 8 separate jobs in parallel if possible; if you have more CPU cores you can increase the number.

git clone https://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..
git clone https://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..

Then, from: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/ download the newest versions of cmusphinx-en-us-....tar.gz and en-70k-....lm.gz

tar -xzf cmusphinx-en-us-....tar.gz
gunzip en-70k-....lm.gz

Then you can finally proceed with the steps from Nikolay's answer:

ffmpeg -i book.mp3 -ar 16000 -ac 1 book.wav
pocketsphinx_continuous -infile book.wav \
    -hmm cmusphinx-en-us-8khz-5.2 -lm en-70k-0.2.lm \
    2>pocketsphinx.log >book.txt

Sphinx works alright. I wouldn't rely on it to make a readable version of the text, but it's good enough that you can search it if you're looking for a particular quote. That works especially well if you use a search algorithm like Xapian (http://www.lesbonscomptes.com/recoll/) which accepts wildcards and doesn't require exact search expressions.

Hope this helps.

nickcrabtree
  • 1,213
12

I you are looking to convert speech to text you could try installing the julius package:

sudo apt install julius

Description:

"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.

Or another option that isn't in Ubuntu's repositories or in the Snap Store is Simon:

... is an open-source speech recognition program and replaces the mouse and keyboard.

Reference Links:

Julius:

Simon:

Flimm
  • 44,031
CoalaWeb
  • 3,227
2

You can use Mozilla DeepSpeech is an opensource speech-to-text tool. But you will need to train the application or download Mozilla's pre-trained model. For my project, the accuracy was still not sufficient, as audio files were not good quality, and used Transcribear instead, a web based editor with speech-to-text capabilities, but you will need to be connected online to upload recordings to the Transcribear server.

John
  • 81
2

vosk-transcriber official CLI from Vosk

I was randomly tab completing after installing Vosk today, previously mentioned at: https://askubuntu.com/a/423849/52975 when I saw they had added a nice CLI wrapper at last, so now tested on Ubuntu 23.10, you can install with the English model as:

pipx install vosk
mkdir -p ~/var/lib/vosk
cd ~/var/lib/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
unzip vosk-model-en-us-0.22.zip
cd -

and then use as:

wget -O think.ogg https://upload.wikimedia.org/wikipedia/commons/4/49/Think_Thomas_J_Watson_Sr.ogg
vosk-transcriber -m ~/var/lib/vosk/vosk-model-en-us-0.22 -i think.ogg -o think.srt -t srt

-i will eat pretty much anything including compressed audio files like .ogg or even video files like .ogv, presumably FFmpeg at work.

Nice! Now they only need a vosk-transcriber --download-model en option and have a default -m directory to finally make things fully clean, but this is already a huge improvement of life.

I played with a few examples to informally evaluate accuracy at: https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/613392#613392

OpenAI Whisper

https://github.com/openai/whisper

Tested on Ubuntu 24.04, install:

sudo apt install ffmpeg
pipx install openai-whisper==20231117

Sample usage:

wget https://upload.wikimedia.org/wikipedia/commons/f/f6/Appuru.wav
time whisper Appuru.wav

Terminal output with this perfectly clean en-US demo: https://commons.wikimedia.org/wiki/File:Appuru.wav

/home/ciro/.local/pipx/venvs/openai-whisper/lib/python3.12/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.000]  The apple does not fall far from the tree.

real 0m7.516s user 0m31.209s sys 0m4.194s

and cwd now contains several output files such as Appuru.srt:

1
00:00:00,000 --> 00:00:03,000
The apple does not fall far from the tree.

so it worked perfectly.

Here I did a longer benchmark with a video: Automatically generate subtitles/close caption from a video using speech-to-text? wand it worked amazingly!

https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/718354#718354 by Franck Dernoncourt reports that to make use of a Nvidia 3090 GPU, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

OpenAI Whisper-based tools

A list: https://github.com/sindresorhus/awesome-whisper#cli-tools learned from: Automatically generate subtitles/close caption from a video using speech-to-text? Perhaps some of those will make the model a bit easier to use.

Speech Note

https://github.com/mkiol/dsnote

This project is a front-end for a bunch of possible backend TTS and STT models on multiple languages. Install and launch:

flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote

opens a GUI:

enter image description here

Then under:

  • Languages
  • English
  • Text to Speech

I can download a model:

enter image description here

They have both Whisper and Vosk and a few others.

Then you can either:

  • Click "Listen" to take voice input from the microphone
  • File > Import from a file to select a sound file containing the speech

and the recognized text will appear in the text box.

CLI-only usage is limited unfortunately: https://github.com/mkiol/dsnote/issues/83

Tested on Speech Note 4.7.0, Ubuntu 24.10.

Benchmarks

https://github.com/Picovoice/speech-to-text-benchmark mentions a few: