Has there been research done regarding processing speech then building a "speaker profile" based off the processed speech? Things like matching the voice with a speaker profile and matching speech patterns and wordage for the speaker profile would be examples of building the profile. Basically, building a model of an individual based solely off speech. Any examples of this being implemented would be greatly appreciated.
3 Answers
Yes, there is. A quick search found this: Multimodal Speaker Identification Based on Text and Speech (2008).
In the abstract, they write
This paper proposes a novel method for speaker identification based on both speech utterances and their transcribed text. The transcribed text of each speaker’s utterance is processed by the probabilistic latent semantic indexing (PLSI) that offers a powerful means to model each speaker’s vocabulary employing a number of hidden topics, which are closely related to his/her identity, function, or expertise. Melfrequency cepstral coefficients (MFCCs) are extracted from each speech frame and their dynamic range is quantized to a number of predefined bins in order to compute MFCC local histograms for each speech utterance, which is time-aligned with the transcribed text. Two identity scores are independently computed by the PLSI applied to the text and the nearest neighbor classifier applied to the local MFCC histograms. It is demonstrated that a convex combination of the two scores is more accurate than the individual scores on speaker identification experiments conducted on broadcast news of the RT-03 MDE Training Data Text and Annotations corpus distributed by the Linguistic Data Consortium.
Under figure 2, they write
Identification rate versus Probe ID when 44 speakers are employed. Average identification rates for (a) PLSI: 69%; (b) MFCCs: 66%; (c) Both: 67%.
In section 4, they write
To demonstrate the proposed multimodal speaker identification algorithm, experiments are conducted on broadcast news (BN) collected within the DARPA Efficient, Affordable, Reusable Speech-to-Text (EARS) Program in Metadata Extraction (MDE).
If you need more papers related, you could use a tool like https://the.iris.ai/ to find related papers.
Speaker identification is quite widely researched domain. Modern approach would be to map speaker information to i-vector, a real-valued vector of 200-400 components that characterizes speaker fully. i-vectors allow very precise speaker identification and verification.
For more information you can check i-vector tutorial
Also you can check state of the art in the results of NIST i-vector challenge
For implementation, you can check the following speaker recognition experiment from Kaldi.
For best accuracy i-vectors are extracted with DNN UBMs, watch out that GMM UBMs are less accurate.
For more in-depth information about speaker recognition methods and algorithms check this textbook.
 
    
    - 271
- 2
- 4
Deepmind recently created a voice synthesiser along those lines. It seems to be incredibly slow, but it might be possible to create a dumped down version of it.
Apparently the task is called parametric TTS (text to speech). This overview might give you some leads.
 
    
    - 4,265
- 13
- 23
 
     
    