We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable "sweet spot" during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.
翻译:我们探索了两种借助可微分数字信号处理(DDSP)创造性改变人声音色的方法。第一种方法受经典交叉合成技术启发:基于从人声输入中提取的音高和响度信息,预训练的DDSP解码器预测噪声源的滤波器及谐波分布。在合成前,通过将预测的谐波分布与输入谐波进行插值调整该分布。我们以Neutone模型形式提供了该方法的实时实现。第二种方法中,自编码器模型在包含人声与乐器训练数据的混合数据集上训练。为应用效果,训练后的自编码器尝试重构人声输入。研究发现,在训练过程中存在一个理想的"甜区":此时模型已学会重构输入人声的音素内容,但仍受训练数据中混合乐器音色的影响。进一步训练后,该效应消失。通过感知评估对两种方法进行比较,发现第二种方法中的自编码器能在未提供显式音素信息的条件下,重构出可辨识的歌词内容。