How can machine learning be used to create synthesized sounds of world instruments?

Question

So far the usual way to recreate the sounds of these musical instruments is via physical modelling synthesis. But what about machine learning methods?

score 0 · Accepted Answer · answered Dec 09 '24 at 20:52

DeepMind's WaveNet initially developed for speech synthesis based directly upon raw audio waveforms model has been adapted for generating realistic audio including musical instruments.

WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music... WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time... WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples.

You can further refer Engel et al (2017) "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders" for WaveNet's application for musucal instrument sound synthesis specifically.

First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

There're also GANSynth (Google) which is an extension of NSynth and uses GANs to model audio data and generate instrument sounds with better temporal coherence. See Engel et al (2019) "GANSynth: Adversarial Neural Audio Synthesis".

Rather than containing all types of audio, NSynth consists solely of individual notes from musical instruments across a range of pitches, timbres, and volumes. Similar to CelebA, all the data is aligned and cropped to reduce variance and focus on fine-scale details, which in audio corresponds to timbre and fidelity. Further, each note is also accompanied by an array of attribute labels to enable exploring conditional generation.

Finally OpenAI's Jukebox can also generate instrument sounds via VQ-VAE-2.

Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-2 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images... Our first raw audio model, which learns to recreate instruments like Piano and Violin. We try a dataset of rock and pop songs, and surprisingly it works.

How can machine learning be used to create synthesized sounds of world instruments?

1 Answers1