I would provide a sound signal of about 2-3 seconds to my neural network. I have trained my network with a single word, like if I speak "Hello" the network may tell if "Hello" is spoken or not, but some other word like "World" is spoken, it will say "Hello" is not spoken. I just want classification of sound if its a specific command or word. What is the best way to do this, I am not a that much advanced in DNN, I only know about NN and CNN, I want to know if there is some research paper or tutorial, or need some explanation about the work.
1 Answers
If you have fixed length speech data you can detect the content using only CNN. You can see that problem as a binary classification (1 if the spoken word is correct, 0 otherwise).
But first, you need to make the input length is fixed. For example, you use 2 seconds as the fixed length. If the recorded speech is more than 2 seconds, you need to crop it, and if the recorded speech is less than 2 seconds you can pad it with 0 values.
Next, You can either use raw data (time-domain) or transform your data using some features extractors method (FFT, MFCC, or MFSC). Then, use CNN as you use it to classify the image. You can assume the graphic of the sound wave as a 2D image.
But, If your data have a variety of length, you can combine CNN to detect each phoneme then combine it as a sequence using RNN or HMM. You can read this method also in the mentioned papers.
- 2,859
- 3
- 23
- 47