SpeechT5 Huggingface voice conversion: how to process whole input

Asked Sep 25 '24 at 13:08

Active Sep 25 '24 at 18:28

Viewed 71 times

I have followed the voice conversion example in the Huggingface blog post, and can replicate that example in a Colab session. As mentioned in the blog, the voice conversion stops consistently in the middle of the example input sentence, due to it interpreting the pause as the end of the sentence.

My end goal would be to use the SpeechT5 model to replicate the product ElevenLabs is offering. There you can upload a sound file up to 50 MB, and expect a voice conversion that time-wise matches the whole input.

I have started writing a Python script that will analyze the input audio clip, detect silence, cut audio only chunks, and send these non-silent chunks into the converter one-by-one. Finally the converted chunks will need to be "glued" together with the all-zero silence chunks, and written into a new WAV file.

Is this the only viable way to make the SpeechT5 model process a whole input file of let's say 10 minutes of 16-bit/ 16 kHz WAV mono input?

edited Sep 25 '24 at 18:28

asked Sep 25 '24 at 13:08

G. Debailly

SpeechT5 Huggingface voice conversion: how to process whole input

0 Answers0