1

I tried a couple of different websites and libraries. Also found this topic from 3.5 years ago - What are the current open source text-to-audio libraries?

It looks like nobody published anything in the last couple of years and most solutions are really really not good. Even Amazon sounds like some weird robot.

One of the best I tried is Coqui, but their best models were never published to GitHub. So TTS on their website sounds perfect, but you have to pay.

Also, ElevenLabs is amazing, but it is not open-source.

I cannot believe that there are no published models that sound good. I need it to generate a huge amount of text, so it will be very expensive to pay for subscriptions. They all charge about $5/20-40 min of speech and it sounds a little bit too much for me.

Could you recommend anything I can use for free? Preferably open-source, but not mandatory.

desertnaut
  • 1,021
  • 11
  • 19

4 Answers4

2

For me, the go-to source for answering the question "What's the state of the art for [task]?" is Papers with Code. They compile tasks, benchmarks, papers (and their associated official/unofficial implementations) into a series of leaderboards.

Consider, the text-to-speech leaderboard, for example. At the top, you see a series of benchmarks which you can click on to see the top performing papers.

Many popular leaderboards (like the one for TTS) also has a libraries section. These may not be the state of the art, but it'll probably be the easiest to get up and running with.

There are somethings to keep in mind, however.

  1. Take benchmarks with a grain of salt. Especially for something like text-to-speech, where it's difficult to quantify performance into a single number.

  2. Research repositories can be very finnicky. Make sure to look at if data/checkpoints are available. I'd recommend starting with existing, maintained libraries, but if you're looking for the state-of-the-art, this might not be an option. You may be able to email paper authors to request checkpoints/data, if they aren't publically available.

Alexander Wan
  • 1,409
  • 1
  • 12
0

Check facebook's Massively Multilingual Speech (MMS) : Text-to-Speech Models. You can find it in the HuggingFace repo.

0

Honestly nothing even comes close to closed source platforms. Since you said "Preferably open-source, but not mandatory" I'll recommend 11labs or HyperVoice. I use 11labs UI and hypervoice API. both similar in quality but hyper wins by far in terms of cost.

0

F5-TTS at Huggingface.com It's free Give it a 10 second voice sample and use it on your text Check it out Also Chatgpt can take your text and punctuate it for pacing before you paste it into F5-TTS - makes it sound much more natural and human.