In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.
翻译:在神经音频信号处理中,音高条件已被用于提升合成器的性能。然而,当使用标准音频到音频重构损失时,联合训练音高估计器和合成器面临挑战,导致对外部音高追踪器的依赖。为解决这一问题,本文提出一种基于最优输运理论的谱损失函数,该函数可最小化谱能量的位移。我们通过一项无监督自编码任务验证了该方法的有效性,该任务将谐波模板拟合至谐波信号。我们利用轻量级编码器联合估计基频和谐波振幅,并通过可微谐波合成器重构信号。所提方法为改进神经音频应用中的无监督参数估计提供了有前景的方向。