Identifying speech sounds from sound waves

Question

TLDR: How do we differentiate, say, a "A" from a "O", how do we identify speech sounds? If formants are the key, how is it possible to identify it regardless of the pitch (fundamental frequency), and regardless of the voice of the person speaking/singing (timbre). See 1) and 2) below for more specific questions.

I have a question that I thought was fairly basic and turns out to be more complex than expected, for someone who doesn’t have specific knowledge of the field.

Sounds are acoustic waves that we can analytically characterize/visualize with a variety of different concepts. I want to understand how those analytical tools relate to the intuitive understanding of a sound that we have (i.e. how would one read a spectrogram/amplitude envelope of a sound and “predict” exactly what is heard). Specifically, how do we differentiate $\textbf{speech sounds}$.

For reference, here is my qualitative understanding of some of the most notable features of a general sound (not only speech). I mostly focus on sounds with a well-defined fundamental frequency (instrument or person singing a note)

Pitch: For a harmonic sound like the pluck of a guitar cord, you extract the fundamental frequency from your Fourier analysis and that’s the perceived pitch. (in the case of a non-harmonic sound i.e. no distinct spectral maxima, I guess you can say there’s no well-defined pitch i.e. you’re just not singing/playing something)
Loudness: amplitude of the soundwave
How “explosive” the sound feels: time dependent amplitude-envelope (e.g. sharp amplitude peak at the attack of the sound)
Timbre (I mostly mean by that what allows us to identify what is making the sound: piano? Guitar? human?): More complex. But as I understand it, the most decisive factors are the distribution of the harmonics in the spectrum (i.e. different instruments have different Fourier coefficients for the same fundamental frequency) and again the time envelope (which echoes to the previous point since it quite clearly comes into play for identifying an instrument). Timbre also allows to differentiate voices. Thus I would suppose that a timbre is “characteristic” of a person, i.e. John has his own identifiable timbre that is unchanging (unless he’s mimicking Martha, but that’s kinda rude)

Now the last point that I’m very curious about but fail to really understand is speech sounds. How can I so accurately differentiate a “AAAA” sound from a “OOOO” sound, regardless of the person speaking (voice timbre) and the pitch used (frequency, more precisely fundamental frequency)? To satisfy this, I imagined (wrongly) that the characteristic of the soundwave related to the specific speech sound must not be correlated to either frequency or timbre. However, doing some research, I came across the notion of $\textbf{formants}$ (https://en.wikipedia.org/wiki/Formant and https://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/spectrogram-sounds.html), which are broad spectral maxima. It seems that the frequency of those peaks are characteristic of, say, a given vowel.

I have 2 problems with that. First, I don’t understand how we can vocalize a “A” sound at different pitches and still keep it absolutely identifiable and very distinct from any other vowel. For example, it is said that for the average man, the first formant for E is 390 Hz and A is 850 Hz. But, I can very well sing a “E” sound at a pitch that gives me a fundamental frequency of 850 Hz, can’t I? Then how are we still able to clearly make the difference? Even if there are other formants, I feel this should at least confuse the ear, no? Those relative positions of the first formants of two different vowels seem to imply that "A" should sound higher, but of course it is not necessarily the case...

Second problem: Having multiple formants with a specific frequency difference/ratio seems to be imposing a particular resonance pattern. Said otherwise, it seems to me that having specific relative frequencies for formants imply having a specific set of Fourier coefficient. And that, to me, was mostly what defined timbre (our ability to differentiate voices). So how come different voices can have the same “formant pattern” that we identify? Shaping our mouth cavity, I imagine we can produce different resonance patterns, but then it seems to me that producing different formants would require us to alter what define our identifiable voice. Basically, what is exactly the difference between timbre and vocalization of vowels? I am confused.

So, two questions to summarize:

Does the relative positions of the first formants imply that a vowel intrisinsically is higher than another? How do we reconcile that with our ability to sing vowels very low/high
The formant structure of a speech sound is akin to its fourier decomposition in my understanding. However, this is also what determines timbre, so what is the difference between both, i.e. how can we determine both timbre and the nature of the speech sound independently?

Sorry for the long post! As it is a very broad question I wanted to try and be clear on what I understand/what I don’t

Identifying speech sounds from sound waves

0 Answers0