Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.
翻译:声音匹配算法旨在通过参数化音频合成来逼近目标波形。深度神经网络在匹配持续谐波音调方面取得了可喜的成果。然而,当目标为非平稳、非谐波信号(如打击乐器)时,任务更具挑战性。我们将这一问题归因于损失函数的不足。一方面,参数域中的均方误差(即“P损失”)简单快速,但未能适应每个参数不同的感知重要性。另一方面,谱时域中的均方误差(即“谱损失”)具有感知动机,并用于可微数字信号处理(DDSP)。然而,谱损失对音高区间的预测能力较差,且其梯度计算成本较高,导致收敛速度缓慢。针对这一难题,我们提出了感知-神经-物理损失(PNP)。PNP是谱损失的最优二次逼近,同时在训练过程中与P损失一样快速。我们以物理建模合成作为解码器,联合时频散射变换(JTFS)作为谱表示,实例化了PNP。通过与其他损失函数的对比,我们展示了其在匹配合成鼓声方面的潜力。