For questions related to the synthesis of speech, not to be confused with synthesizing text or formal language expressions or expressions in context free grammars. Speech in this context is either a sequence of audio samples, a sequence of spectral representations in the frequency domain, or a representation of phonic symbols that represent natural speech.
Questions tagged [speech-synthesis]
17 questions
                    
                    4
                    
            votes
                
                3 answers
            
        Open-source vocal cloning (speech-to-speech neural style transfer)
I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I…
         
    
    
        ginjaemocoes
        
- 37
- 1
- 9
                    3
                    
            votes
                
                0 answers
            
        Can computers recognise "grouping" from voice tonality?
In human communication, tonality or tonal language play many complex information, including emotions and motives. But excluding such complex aspects, tonality serves some a very basic purpose of "grouping" or "taking common" functions such as: 
The…
        user27217
                    3
                    
            votes
                
                2 answers
            
        What is the difference between automatic transcription and automatic speech recognition?
What is the difference between automatic transcription and automatic speech recognition? Are they the same?
Is my following interpretation correct?
Automatic transcription: it converts the speech to text by looking at the whole spoken input…
         
    
    
        Murugesh
        
- 141
- 2
                    2
                    
            votes
                
                1 answer
            
        How to measure the similarity the pronunciation of two words?
I would like to know how I could measure the pronunciation of two words. These two words are quite similar and differ only in one vowel.
I know there is, e.g., the Hamming distance or the Levenshtein distance but they measure the "general"…
         
    
    
        Ben
        
- 205
- 1
- 8
                    2
                    
            votes
                
                0 answers
            
        Model for direct audio-to-audio speech re-encoding
There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'.
These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…
         
    
    
        NeverWasMyRealName
        
- 21
- 1
                    2
                    
            votes
                
                0 answers
            
        How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?
I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:
I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…
         
    
    
        Envelo Lee
        
- 21
- 1
                    2
                    
            votes
                
                0 answers
            
        What is the State-of-the-Art open source Voice Cloning tool right now?
I would like to clone a voice as precisely as possible. Lately, impressive models have been released that only need about 10 s of voice input (cf. https://github.com/CorentinJ/Real-Time-Voice-Cloning), but I would like to go beyond that and clone a…
         
    
    
        Remind
        
- 21
- 1
                    1
                    
            vote
                
                0 answers
            
        How to achieve Voice Conversion Using Voice Samples of a Specific Person using any voice as input?
I'm working on a project involving voice conversion, aiming to transform a voice to sound like a specific person speaking Darija (a Moroccan Arabic dialect). I have collected a set of voice samples from the target person and prepared them in a…
         
    
    
        anasse
        
- 11
- 1
                    1
                    
            vote
                
                4 answers
            
        What is the best Text-to-speech model available open-source?
I tried a couple of different websites and libraries. Also found this topic from 3.5 years ago - What are the current open source text-to-audio libraries?
It looks like nobody published anything in the last couple of years and most solutions are…
         
    
    
        Yevhen Salitrynskyi
        
- 27
- 1
- 2
                    1
                    
            vote
                
                0 answers
            
        Is Speech to Speech with changing the voice to a given other voice possible?
Background:
I am working on a research project to use (demonstrate) the possibilities of Machine Learning and AI in artistic projects. One thing we are exploring is demonstrating deep fakes on stage. Of course, a deep fake is not easy to make.…
         
    
    
        Nathan
        
- 143
- 4
                    1
                    
            vote
                
                0 answers
            
        How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?
I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2.
I'm very confused about a simple aspect of text-to-speech…
         
    
    
        Joe Black
        
- 181
- 2
- 6
                    1
                    
            vote
                
                0 answers
            
        Can't figure out what's going wrong with my dataset construction for multivariate regression
TL;DR: I can't figure out why my neural network wont give me a sensible output. I assume it's something to do with how I'm presenting the input data to it but I have no idea how to fix it.
Background:
I am using matched pairs of speech samples to…
         
    
    
        NotQuiteHere
        
- 19
- 1
                    1
                    
            vote
                
                0 answers
            
        Improving the performance of a DNN model
I have been executing an open-source Text-to-speech system Ossian. It uses feed forward DNNs for it's acoustic modeling. The error graph I've got after running the acoustic model looks like this:
Here are some relevant information:
Size of Data: 7…
         
    
    
        Arif Ahmad
        
- 111
- 1
                    0
                    
            votes
                
                1 answer
            
        Adding voices to voice synthesis corpuses
If one uses one of the open source implementations of the WaveNet generative speech synthesis design, such as https://r9y9.github.io/wavenet_vocoder/, and trains using something like the CMU's arctic corpus, now can one add a voice that sounds…
         
    
    
        Douglas Daseeco
        
- 7,543
- 1
- 28
- 63
                    0
                    
            votes
                
                0 answers
            
        How to resize the time-frequency spectrum of 1D signal so that image classification model can be used?
In case of 1D signal generally they tend to be around 1000s of sample points for each trial especially for biomedical signal. The time-frequency spectrum then can have shape like [256,1000] or [50,1000] etc.
However most popular image classification…
        
    