For questions about audio processing tasks in the context of artificial intelligence.
Questions tagged [audio-processing]
41 questions
9
votes
1 answer
Is it possible to clean up an audio recording of a lecture using some type of AI system?
Is it possible to clean up an audio recording of a lecture from a smartphone (i.e. remove the background noise) using some type of AI system?
Thibault Molleman
- 99
- 1
- 1
- 3
5
votes
1 answer
How can I find a specific word in an audio file?
I'm trying to train and use a neural network to detect a specific word in an audio file. The input of the neural network is an audio of 2-3 seconds duration, and the neural network must determine whether the input audio (the voice of a person)…
Ali.kavari76
- 121
- 6
3
votes
1 answer
Can I filter barking sounds on the television?
My dog goes bonkers every time the sound of a barking dog is heard on a television program. I never noticed this before but literally every movie or show with an outdoors setting eventually includes the sound of a barking dog.
Is it possible to…
AlanD
- 31
- 2
3
votes
1 answer
What are the differences between RVC and SO-VITS-SVC models?
I'm trying to decide which one to use for my project but I can't find anywhere specific differences or comparisons of the models.
Hiperfly
- 33
- 1
- 3
3
votes
2 answers
Can AI be used to reverse engineer a black box?
A while back I posted on the Reverse Engineering site about an audio DSP system whose designer had passed away and whose manufacturer no longer had source code (but the question was deleted). Basically, the audio filter settings are passed from a…
chmedly
- 131
- 2
2
votes
1 answer
Difficulty understanding Keras LSTM fitting data
I'm try to train a RNN with a chunk of audio data, where X and Y are two audio channels loaded into numpy arrays. The objective is to experiment with different NN designs to train them to transform single channel (mono) audio into a two channel…
Dmitry
- 29
- 2
2
votes
0 answers
How to prepare audio data for deep learning?
Audio data is typically an array with the waveform represented by values from -1 to 1. There are two issues with that:
if all values are inverted, e.g. -1 becomes 1 and 1 becomes -1, the audio doesn't change. But if for example I need to find…
Ford F150 Gaming
- 121
- 3
2
votes
0 answers
Understanding gumbel-softmax backpropagation in Wav2Vec papers
I'm studying the series of Wav2Vec papers, in particular, the vq-wav2vec and wav2vec 2.0, and have a problem understanding some details about the quantization procedure.
The broader context is this: they use raw audio and first convert it to…
Peter Franek
- 384
- 1
- 4
- 14
2
votes
2 answers
Is it realistic to train a transformer-based model (e.g. GPT) in a self-supervised way directly on the Mel spectrogram?
In music information retrieval, one usually converts an audio signal into some kind "sequence of frequency-vectors", such as STFT or Mel-spectrogram.
I'm wondering if it is a good idea to use the transformer architecture in a self-supervised manner…
Peter Franek
- 384
- 1
- 4
- 14
2
votes
0 answers
Model for direct audio-to-audio speech re-encoding
There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'.
These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…
NeverWasMyRealName
- 21
- 1
2
votes
1 answer
I want to determine how similar a given song is to Queen's songs. Am I headed in the right direction?
I've asked this question before (@ Reddit) and people suggested CNNs on a mel spectrogram more than anything else. This is great.
But I'm sort of stuck at: label some music data as "queen" and "not queen" and have this be the training set. Like,…
Mike Johnson Jr
- 121
- 1
2
votes
1 answer
How to get more accuracy of the logistic regression model?
I am working on a Baby Crying Detection model using logistic regression.
Out of $581$ audios, $222$ are of a baby crying. Each audio is of $5$ seconds.
what I have done is convert each audio into numbers. and those numbers go into a .csv file. so…
Muhammad Waqar Anwar
- 121
- 1
2
votes
0 answers
Is there an AI that can complete Deezer Spleeter work?
I have used Deezer Spleeter but it produces echoes aside the stems, so I wonder if there is already an AI that remove echoes noises.
Mohammed Mehdi TBER
- 21
- 2
2
votes
0 answers
How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?
I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:
I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…
Envelo Lee
- 21
- 1
2
votes
0 answers
State of the art in voice recognition
In the media there's lot of talk about face recognition, mainly with respect to identifying faces (= assigning to persons). Less attention is paid to the recognition of facially expressed emotions but there's a lot of research done into this…
Hans-Peter Stricker
- 931
- 1
- 8
- 23