How to get speech output from entered text by using command-line?
Also facility to change speech rate, pitch, volume etc using simple command.
How to get speech output from entered text by using command-line?
Also facility to change speech rate, pitch, volume etc using simple command.
In order of descending popularity:
say converts text to audible speech using the GNUstep speech engine.
sudo apt-get install gnustep-gui-runtime
say "hello"
festival General multi-lingual speech synthesis system.
sudo apt-get install festival
echo "hello" | festival --tts
spd-say sends text-to-speech output request to speech-dispatcher
sudo apt-get install speech-dispatcher
spd-say "hello"
espeak is a multi-lingual software speech synthesizer.
sudo apt-get install espeak
espeak "hello"
espeak is a nice little tool.
I just like playing around with it in a command line. You might find it conflicts with Pulseaudio so I'm using a long-winded version that negates having to set it up properly.
sudo apt-get install espeak
espeak --stdout "this is a test" | paplay
espeak --help will show you the options to calibrate reading speed, pitch, voice, etc.
When you're doing your notes, save them as a text file and then:
echo "these are my notes" > text.txt
espeak --stdout -f text.txt > text.wav
paplay text.wav # you should hear "these are my notes"
You can then play around with ffmeg et al to compress this down from PCM to something more manageable like MP3 or OGG. But that's a different story.
From man spd-say:
NAME
spd-say - send text-to-speech output request to speech-dispatcher
SYNOPSIS
spd-say [options] "some text"
DESCRIPTION
spd-say sends text-to-speech output request to speech-dispatcher process which handles it and ideally outputs the result
to the audio system.
OPTIONS
-r, --rate
Set the rate of the speech (between -100 and +100, default: 0)
-p, --pitch
Set the pitch of the speech (between -100 and +100, default: 0)
-i, --volume
Set the volume (intensity) of the speech (between -100 and +100, default: 0)
Hence you can get text-to-speech by following command:
spd-say "<type text>"
Ex:
spd-say "Welcome to Ubuntu Linux"
You can also set speech rate, pitch, volume etc. see man-page.
Python Google Speech :
pip install google_speech
google_speech "Test the hello world"
Svox From Android :
apt-get install svox-pico
pico2wave --wave=test.wav "Test the hello world"
play test.wav
Svox Nanotts :
git clone https://github.com/gmn/nanotts.git
cd nanotts
make
./nanotts -v en-US "Test the hello world"
Linked resource: Comparison of speech synthesizers
Post source: Linuxhacks.org
Disclosure: I am the owner of Linuxhacks.org
Mbrola doesn't work since 11.10.
SVOX (pico) tools are easy to install, easy to use and brings good quality voices in Ubuntu. Install it:
sudo apt-get install libttspico0 libttspico-utils libttspico-data
Even more easy, you can use LibreOffice in combination with SVOX (pico) tools by install the "Read Text" extension and you obtain a "GUI" for this excellent TTS software:
Set up Read Text Extension's options with Tools - Add-ons - Read selection.... Use /usr/bin/python as the external program. Select a command line option that includes the token (PICO_READ_TEXT_PY).
That's what I use. And it sounds natural, it's easy to understand and it recognizes units (m, °C,kg, ...).
Here is my first post about pico2wave.
All you have to do is: Go to Ubuntu Software Center and search for "pico". You'll find 4 or 5 entries with "Small Footprint Ling...". Install them.
A possible use of pico2wave is described in my first posting (follow the link above).
I think what we need at this point is the big summary table, notably looking out for any tool that sounds remotely natural given our 2024-ongoing "deep learning revolution" (the problem now with this Cambrian Explosion is that the packages break every week and only work on certain systems).
| Tool | Sounds remotely natural | Output to file | Multilingual | Tested on |
|---|---|---|---|---|
| pico2wave (libttspico-utils 1.0+git20130326-14) | y. Some weird distortions, but reasonable. | y | -l fr-FR |
24.04 |
| idiap/coqui-ai-TTS 0.24.1 + Tacotron2 | y. Output is randomly different each time. Most words are awesome. Punctuation timing is off. Sometimes it goes completely crazy and it is hilarious. | --out_path tmp.wav |
24.04 | |
| Speech Note 4.7.0 + Mimic3 Arctic Aew Low | y | grom GUI only | y | 24.10 |
| Speech Note 4.7.0 + Piper Amy Low Female | y | from GUI only | y | 24.10 |
| festival 2.5.0 + festvox-us-slt-hts 2010.10.25 | y. Not amazing, but OK. Slight voice distortion and punctuation off. | n | --language english |
24.04 |
| spd-say (speech-dispatcher 0.12.0) | n | n | -l fr |
24.04 |
| say (gnustep-gui-runtime 0.30.0) | n | n | n | 24.04 |
| espeak 1.48.15 | n | --stdout > tmp.wav |
-v fr |
24.04 |
| festival 2.5.0 | n | n | --language english |
24.04 |
| svox nanotts d8b91f3 | n | 24.04 | ||
| espeak-ng 1.51 | n | 24.04 | ||
| piper | 24.04 | |||
| toirtoise-tts 3.0.0 | 24.04 |
Empty cell means "unknown, untested".
My quick test strings are:
en: "Hello, my name is John Smith. What is your name?"fr: "Bonjour, je m'appelle Jean Jacques. Tu t'appelles comment?""Remotely natural" is of course extremely subjective, and will suffer from the continual moving of AI goalposts as things evolve and we get used to better systems. For now, maybe I'd consider it something along "good enough for an informal video voiceover".
Previously mentioned at: https://askubuntu.com/a/1466489/52975
At https://github.com/MycroftAI/mimic3/issues/83#issuecomment-2740023510 a maintainer said it's not maintained anymore.
On Ubuntu 24.04 in a clean virtualenv running:
pip install piper-tts
fails with:
ERROR: Cannot install piper-tts==1.1.0 and piper-tts==1.2.0 because these package versions have conflicting dependencies.
bug report: https://github.com/rhasspy/piper/issues/509
On Ubuntu 24.04:
sudo apt install libttspico-utils
pico2wave -w tmp.wav "Hello, my name is John Smith. What is your name?"
ffplay -autoexit tmp.wav
https://github.com/idiap/coqui-ai-TTS
pipx install coqui-tts
tts --text "Hello, my name is John Smith. What is your name?" --pipe_out | aplay
The first time you call it it installs the necessary model automatically.
Sound takes 5-10 s to start coming out on each invocation, which is unacceptable for frequent short sentences.
The default model seems to be Tacotron2: https://github.com/NVIDIA/tacotron2 but you can select other models from CLI.
Previously mentioned at: https://askubuntu.com/a/1447599/52975
Does not support python 3.12 (Ubuntu 24.04), pip install TTF fails. Report: https://github.com/coqui-ai/TTS/issues/3257 Collaborator: https://github.com/coqui-ai/TTS/issues/3257#issuecomment-2096792618 says instead use idiap/coqui-ai-TTS
Based on the README similarity it seems to be a fork of https://github.com/mozilla/TTS
Mentioned at: https://askubuntu.com/a/908889/52975 tested on Ubuntu 24.04:
sudo apt install festvox-us-slt-hts
festival -b '(voice_cmu_us_slt_arctic_hts)' '(SayText "Hello, my name is John Smith. What is your name?")'
https://github.com/neonbjb/tortoise-tts
On Ubuntu 24.04:
virtualenv -p python3 .venv
. .venv/bin/activate
pip install tortoise-tts==3.0.0
fails with:
ERROR: Failed building wheel for tokenizers
Bug report: https://github.com/neonbjb/tortoise-tts/issues/728
Speech Note supports it and it worked there.
Previously mentioned at: https://askubuntu.com/a/1447599/52975
On Ubuntu 24.10 I tried:
sudo apt-get install libespeak-ng1
pipx install 'mycroft-mimic3-tts[all]'
but that failed with:
Fatal error from pip prevented installation. Full pip output in file:
/home/ciro/.local/pipx/logs/cmd_2025-03-20_07.57.51_pip_errors.log
pip failed to build package:
libwapiti
Some possibly relevant errors from pip install:
error: subprocess-exited-with-error
libwapiti/src/api.c:157:36: error: passing argument 4 of ‘tag_nbviterbi’ from incompatible pointer type [-Wincompatible-pointer-types]
libwapiti/src/api.c:157:46: error: passing argument 6 of ‘tag_nbviterbi’ from incompatible pointer type [-Wincompatible-pointer-types]
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (libwapiti)
Error installing mycroft-mimic3-tts from spec 'mycroft-mimic3-tts[all]'.
Bug report: https://github.com/MycroftAI/mimic3/issues/83
Speech Note supports it and it worked there.
https://github.com/mkiol/dsnote
This project is front-end for a bunch of possible backend TTS and STT models on multiple languages. That is cool because with it you can quickly try several models on a given text to decide which one is better, without having to try to install a bunch of differently broken software systems. Trying:
flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote
opens a GUI.
Then under:
I can download a model. Just note that some of them require "voice samples" presumably to clone from, which might or might not be what you want.
Then you can type your text on the GUI an click the "Read" button to hear it.
And to save to a file:
Now for CLI-only attempt:
flatpak run net.mkiol.SpeechNote --print-available-models tts
lists models I've downloaded:
en_coqui_fairseq_eng "English (Coqui MMS) / en"
en_piper_us_amy_low "English (Piper Amy Low Female) / en"
en_rhvoice_alan "English (RHVoice Alan Male) / en"
en_whisperspeech_q4_base_enpl "English (WhisperSpeech Base) / en"
but TODO also opens the GUI. And finally TODO CLI-only TTS? This is a starting point:
flatpak run net.mkiol.SpeechNote \
--id en_coqui_fairseq_eng \
--text 'Hello, my name is John Smith. What is your name?'
After changing a setting under:
I can also make it speak with:
flatpak run net.mkiol.SpeechNote \
--id en_coqui_fairseq_eng \
--text 'Hello, my name is John Smith. What is your name?' \
--action start-reading-text
but I don't know how to save to file from the CLI.
Related: https://github.com/mkiol/dsnote/issues/83
Tested on Speech Note 4.7.0, Ubuntu 24.10.
No easy CLI instructions:
Bibliography:
And yet another espeak gui: gespeaker. It uses both espeak and mbrola engines. Also, it has more options than espeak-gui.
The following is not a FLOSS solution, but you may find it worthwhile. (it is a wine solution),
I'm personally very keen on TTS, I use it quite often... eg. listening to a rambling discourse which I would never bother to stick with otherise (because I need to get another cup of coffee... :)
A few things I've discovered along the way.. or should I say, things I haven't discovered along the way... To put it bluntly: Every piece of FOSS TTS voice software I've tried is under par and therefore unsuitable for any semi-protracted listening...
I currently use ATnT's NaturalVoices. It is only available for Windows (maybe the Mac), but it does run under wine in Ubuntu .. (it has minor glytch, where I sometimes need to click on the panel when I move away from the reader... It is a minor issue when compared to the advantage gained by quality of speech from NatualVoices.
Some other things I've found to be virtually essential for a half-sensible listening experience, are;...
These TTS progamas are not intelligent (well maybe as intelligent as a young baboon) .. so they need every bit of help they can get. and there is one (and only one Reader program I've found which helps greatly in this.. The app is called ReadPlease (2003 Pro)... It allowd you to specially modify words and groups of word to be pronounced as you want them... It is by no means perfect, but for me, it made the difference between the entire process being usable and not usable...
The speech in Natural Voices is "okay", but it is a bit boring. There are other good products too, but they are all for Windows, unfortunately)..
It infeclts surprisingl well sometimes .. but OMG, initially it is a pain! .. so #2 is *patience... and lots of updating of your "special words" list ... By patience, I mean you(I) actually became accustomed to my particular baboon's speech patterns :)... and by the way, I currently have about 3000 words that now sound "Human" enough that I no longer cringe when I hear them.
3.. "Follow the Bouncing Ball" ... Again because the voice is never as good as a real speaker, things sometimes need to be clarified .. . The Reader program I use has one feature for which I even put up with its clunky looking interface.... Is has a "select the currently being read" word option.. Many readers have this, but ReadPlease keeps the current line bang on center of the screen .. This is invaluable to be able to see ahead and behind to quickly re-read what you just missed (so auto-centering the curent line is good)...
Well that's my experience.. I'm going to make a coffee now, and while I'm doing it, I'll be listening to this, to see how it "reads".... TTS is surprisingl good for picking up typos (I make lots of typos)...
If something as good as ATnT NaturalVoices turns up on the Ubuntu repository, I'll jump at it.
Here is a link to some samples of Natural Voices: I use "MIke"
For festival (the voice seems more natural to me):
sudo apt-get install festival
echo "hello" | festival --tts
Pitch and speed configuration:
create ~/.festivalrc with the following content:
(Parameter.set 'Audio_Command "play -b 16 -c 1 -e signed-integer -r $SR -t raw $FILE tempo 1.5 pitch -100")
(Parameter.set 'Audio_Method 'Audio_Command)
See also http://www.solomonson.com/content/ubuntu-linux-text-speech
Update: tried on another Ubuntu computer. Had to install English speech engine package to work with festival properly:
sudo apt-get install festvox-kallpc16k
Also play is a cli command which comes with the sox package:
sudo apt-get install sox
Meet espeak-ng - A multi-lingual software speech synthesizer:
espeak-ng "text to read"
espeak-ng -f "~/file to read"
It uses a default English voice, but there are numerous other voices for other languages and even dialects available and can be listed with espeak-ng --voices (for all) or e.g. espeak-ng --voices=en (for English). They can be set with -v together with either the language abbreviation or the file name, e.g. for Scottish or Swahili:
espeak-ng -v en-gb-scotland "text to read" # language name
espeak-ng -v bnt/sw "text to read" # file name: “bnt” for Bantu, “sw” for Swahili
There are many other options available, e.g. -s for the speed and -w to write the output to a wave file, see the manpage linked below.
espeak-ng (“ng” for “next generation”) is an actively developed fork of the original espeak speech synthesizer software, see the History chapter on Wikipedia. Both are available from the official sources via the package espeak or espeak-ng respectively.
Update for 2023. Pico2wave is a very lightweight utility, however these two are very natural sounding:
Audio comparison of free Linux TTS 2022. - YouTube
Even though you've already accepted an answer, I wanted to mention festival, which I like quite a lot too. This post on the Ubuntu forums has a lot of information on getting very nice voices set up for it.
The tool gTTS is great for generating audio files from text. It uses the Google Translate's text-to-speech API, and generates mp3 files.
Given that it uses pip for installation, I strongly recommend you install Miniconda, and then use conda to create an environment where you can install gTTS. You can download Miniconda from here.
gTTS GitHub repository and documentation.
A fast, local neural text to speech system. Check site project for installation, download of a voice and usage. For e.g.:
echo 'Welcome to the world of speech synthesis!' | \
./piper --model blizzard_lessac-medium.onnx --output_file welcome.wav
Balabolka under Wine works fine (for me) with SAPI4 voices (SAPI5 voices are not detected on my Linux system). It can open files and start reading.
Here is link to wine's AppDB entry for Balabolka.