Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice.
翻译:感知声音匹配(PSM)旨在寻找合成器的输入参数,以最佳地模仿音频目标。面向PSM的深度学习优化神经网络,用于分析并重构预录制的样本。在此背景下,本文探讨了当训练集由可微合成器生成时,设计合适损失函数的问题。我们的主要贡献是感知-神经-物理损失(PNP),旨在解决感知相关性与计算效率之间的权衡问题。PNP的核心思想是在每个训练样本附近,将合成参数对听觉特征的影响线性化。该线性化过程可大规模并行化,可预计算,并在梯度下降过程中相比可微数字信号处理(DDSP)实现百倍加速。我们在两个非平稳声音数据集上验证了PNP:一个AM/FM琶音器和一个矩形薄膜物理模型。实验表明,PNP能够以联合时频散射变换(JTFS)作为听觉特征加速DDSP,同时保持其感知保真度。此外,我们评估了PSM中其他设计选择的影响:参数重缩放、预训练、听觉表示和梯度裁剪。我们在两个数据集上报告了最先进的结果,并发现PNP加速的JTFS对PSM性能的影响比其他任何设计选择都更大。