Does anyone know of an application that can convert audio to text?
5 Answers
The software you can use is Vosk-api, a modern speech recognition toolkit based on neural networks. It supports 7+ languages and works on variety of platforms including RPi and mobile.
First you convert the file to the required format and then you recognize it:
ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav
Then install vosk-api with pip:
pip3 install vosk
Then use these steps:
git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav > result.json
The result will be stored in json format.
The same directory also contains an srt subtitle output example, which is easier to evaluate and can be directly useful to some users:
python3 -m pip install srt
python3 ./test_srt.py test.wav
The example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:
one zero zero zero one
nine oh two one oh
zero one eight zero three
The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".
The SRT generated above reads:
1
00:00:00,870 --> 00:00:02,610
what zero zero zero one
2
00:00:03,930 --> 00:00:04,950
no no to uno
3
00:00:06,240 --> 00:00:08,010
cyril one eight zero three
so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.
Next I also tried with the vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:
mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model
and the result was:
1
00:00:00,840 --> 00:00:02,610
one zero zero zero one
2
00:00:04,026 --> 00:00:04,980
i know what you window
3
00:00:06,270 --> 00:00:07,980
serial one eight zero three
which got one more word correct.
Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3.
- 31,312
- 533
I know this is old, but to expand on Nikolay's answer and hopefully save someone some time in the future, in order to get an up-to-date version of pocketsphinx working you need to compile it from the github or sourceforge repository (not sure which is kept more up to date). Note the -j8 means run 8 separate jobs in parallel if possible; if you have more CPU cores you can increase the number.
git clone https://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..
git clone https://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..
Then, from: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/
download the newest versions of cmusphinx-en-us-....tar.gz and en-70k-....lm.gz
tar -xzf cmusphinx-en-us-....tar.gz
gunzip en-70k-....lm.gz
Then you can finally proceed with the steps from Nikolay's answer:
ffmpeg -i book.mp3 -ar 16000 -ac 1 book.wav
pocketsphinx_continuous -infile book.wav \
-hmm cmusphinx-en-us-8khz-5.2 -lm en-70k-0.2.lm \
2>pocketsphinx.log >book.txt
Sphinx works alright. I wouldn't rely on it to make a readable version of the text, but it's good enough that you can search it if you're looking for a particular quote. That works especially well if you use a search algorithm like Xapian (http://www.lesbonscomptes.com/recoll/) which accepts wildcards and doesn't require exact search expressions.
Hope this helps.
- 1,213
- 561
- 5
- 3
I you are looking to convert speech to text you could try installing the julius package:
sudo apt install julius
Description:
"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.
Or another option that isn't in Ubuntu's repositories or in the Snap Store is Simon:
... is an open-source speech recognition program and replaces the mouse and keyboard.
Reference Links:
Julius:
Simon:
You can use Mozilla DeepSpeech is an opensource speech-to-text tool. But you will need to train the application or download Mozilla's pre-trained model. For my project, the accuracy was still not sufficient, as audio files were not good quality, and used Transcribear instead, a web based editor with speech-to-text capabilities, but you will need to be connected online to upload recordings to the Transcribear server.
- 81
vosk-transcriber official CLI from Vosk
I was randomly tab completing after installing Vosk today, previously mentioned at: https://askubuntu.com/a/423849/52975 when I saw they had added a nice CLI wrapper at last, so now tested on Ubuntu 23.10, you can install with the English model as:
pipx install vosk
mkdir -p ~/var/lib/vosk
cd ~/var/lib/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
unzip vosk-model-en-us-0.22.zip
cd -
and then use as:
wget -O think.ogg https://upload.wikimedia.org/wikipedia/commons/4/49/Think_Thomas_J_Watson_Sr.ogg
vosk-transcriber -m ~/var/lib/vosk/vosk-model-en-us-0.22 -i think.ogg -o think.srt -t srt
-i will eat pretty much anything including compressed audio files like .ogg or even video files like .ogv, presumably FFmpeg at work.
Nice! Now they only need a vosk-transcriber --download-model en option and have a default -m directory to finally make things fully clean, but this is already a huge improvement of life.
I played with a few examples to informally evaluate accuracy at: https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/613392#613392
OpenAI Whisper
https://github.com/openai/whisper
Tested on Ubuntu 24.04, install:
sudo apt install ffmpeg
pipx install openai-whisper==20231117
Sample usage:
wget https://upload.wikimedia.org/wikipedia/commons/f/f6/Appuru.wav
time whisper Appuru.wav
Terminal output with this perfectly clean en-US demo: https://commons.wikimedia.org/wiki/File:Appuru.wav
/home/ciro/.local/pipx/venvs/openai-whisper/lib/python3.12/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.000] The apple does not fall far from the tree.
real 0m7.516s
user 0m31.209s
sys 0m4.194s
and cwd now contains several output files such as Appuru.srt:
1
00:00:00,000 --> 00:00:03,000
The apple does not fall far from the tree.
so it worked perfectly.
Here I did a longer benchmark with a video: Automatically generate subtitles/close caption from a video using speech-to-text? wand it worked amazingly!
https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux/718354#718354 by Franck Dernoncourt reports that to make use of a Nvidia 3090 GPU, add the following after conda activate whisperpy39:
pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch
OpenAI Whisper-based tools
A list: https://github.com/sindresorhus/awesome-whisper#cli-tools learned from: Automatically generate subtitles/close caption from a video using speech-to-text? Perhaps some of those will make the model a bit easier to use.
Speech Note
https://github.com/mkiol/dsnote
This project is a front-end for a bunch of possible backend TTS and STT models on multiple languages. Install and launch:
flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote
opens a GUI:
Then under:
- Languages
- English
- Text to Speech
I can download a model:
They have both Whisper and Vosk and a few others.
Then you can either:
- Click "Listen" to take voice input from the microphone
- File > Import from a file to select a sound file containing the speech
and the recognized text will appear in the text box.
CLI-only usage is limited unfortunately: https://github.com/mkiol/dsnote/issues/83
Tested on Speech Note 4.7.0, Ubuntu 24.10.
Benchmarks
https://github.com/Picovoice/speech-to-text-benchmark mentions a few:
- LibriSpeech. This one is also part of MLPerf v3.1.
- TED-LIUM
- Common Voice
- 31,312

