Open-source vocal cloning (speech-to-speech neural style transfer)

Question

I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I have a set of recordings from voice actors. I would like to record my voice, in English or other languages, then run a neural network and produce an audio with the same text, intonation and emotion but with roughly the actors' voices. It doesn't need to be perfect; 80% right and believable would be enough to get good feedback and reach a final version of the script before recording. I have 30 minutes to one hour of utterances from each voice I want to clone.

The closest I have found is Resemble.ai, which has an impressive video, but the public plan is only in English and other languages are prohibitively expensive. The engineer published a masters' thesis as an open-source project, but this project does only text-to-speech, not speech-to-speech. Another startup is play.ht, but again it seems to be English-only.

This open source project seems to do what I want, cloning Kate Winslet's voice, but it has no installation instructions and so I haven't tried yet.

Can you recommend an open-source project, ideally in Python and Tensorflow, to roughly replace a voice with another?

Note: This question is similar to What is the State-of-the-Art open source Voice Cloning tool right now? , except that that question is old and the project mentioned only does text-to-speech, not speech-to-speech.

score 3 · Answer 1 · edited Nov 20 '23 at 10:35

Tensorflow code for "one-to-one" style transfer:

https://github.com/phiana/speech-style-transfer-vae-gan-tensorflow

it's the implementation of a 2021 paper. Speech style transfer, voice cloning or speech-to-speech synthesis are the keywords. Further research (looking at the state of the art) would yield some papers:

MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer (https://ieeexplore.ieee.org/document/9726166)
Expressive Neural Voice Cloning (https://expressivecloning.github.io/)

It seems to be a research area without much activity though. Maybe you could add something interesting :)

score 2 · Accepted Answer · edited Nov 15 '23 at 11:17

Additional projects that might be of interest:

Neural Voice Cloning with a Few Samples - NeurIPS 2018 (Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou)

A neural voice cloning system is introduced, using a few audio samples to create personalized speech interfaces. Two approaches are explored: speaker adaptation, which fine-tunes a multi-speaker model with cloning samples, and speaker encoding, which trains a separate model to infer new speaker embeddings from cloning audios. Both methods achieve good performance in terms of speech naturalness and similarity to the original speaker. Although speaker adaptation offers better naturalness and similarity, speaker encoding demands less cloning time and memory, making it suitable for low-resource deployment.

Here is an open-source implementation of the paper, but the GitHub page says the project is archived since February 2021 and read-only.

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning, arXiv:2109.11115 [cs.SD]

In this paper, the authors present a novel one-shot voice cloning algorithm called Unet-TTS that has good generalization ability for unseen speakers and styles. Based on a skip-connected U-net structure, the new model can efficiently discover speaker-level and utterance-level spectral feature details from the reference audio, enabling accurate inference of complex acoustic characteristics as well as imitation of speaking styles into the synthetic speech. According to both subjective and objective evaluations of similarity, the new model outperforms both speaker embedding and unsupervised style modeling (GST) approaches on an unseen emotional corpus.

The GitHub repo has a link to a Google Colab.

ElevenLabs.io

(Not Open Source, but has a free tier. Voice cloning becomes available in the Starter Tier, starting at 5$/month.)

ElevenLabs initially built new text-to-speech models which rely on high compression and context understanding to render human speech ultra-realistically. Their tools aim to provide the necessary quality for voicing news, newsletters, books and videos. They also offer a suite of tools for voice cloning and designing synthetic voices.

BeyondWords.io

(Not Open Source, but has a free tier and is a partner of the Open Voice Network, a non-profit industry association dedicated to making voice technology worthy of user trust and it operates as a directed fund of The Linux Foundation.)

Voice cloning is part of the enterprise plan with custom pricing and requires 2-8 hours of recorded utterances following their script. See an example of original and cloned voice in English on YouTube. Although it sources non-English voices from partners such as Google and Amazon, it does not seem to support voice cloning in languages other than English.

score 1 · Answer 3 · answered Jan 02 '25 at 12:54

Just stumbled upon your question and thought I'd share my experience with TikTokVoice.net. I was actually looking for something similar a while back.

While it's not open-source (I know, I know... that's what you asked for ), I found it pretty handy for quick voice generation. The cool thing is you can just jump in and try it without signing up - saved me a bunch of time when I was prototyping some content.

What I like:

Super generous with the free stuff (seriously, they don't get stingy)
Those TikTok-style voices are pretty spot on
Dead simple to use (like, literally paste & play)

The not-so-great:

Not open source (obviously)
Can't tweak the technical stuff much
Limited to social media type voices

Just my 2 cents! Let me know if you try it - curious to hear what you think!

Edit: Btw, definitely check out the other open-source options people mentioned here. They're solid choices if you need more technical control.

Open-source vocal cloning (speech-to-speech neural style transfer)

3 Answers3

Linked