Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.
翻译:在现代深度学习框架中,由于线性预测(LP)算子的递归形式,对其进行端到端的音频合成训练速度较慢。此外,作为加速方法的逐帧近似法在测试条件下(此时LP需逐样本计算)泛化能力不足。实现高效的逐样本可微分LP是消除这一障碍的关键。我们将GOLF声码器中高效的时不变LP实现推广至时变情形。结合经典源-滤波模型,改进后的GOLF在LP系数学习与语音重建方面均优于其逐帧近似版本。此外,在我们的听觉测试中,GOLF合成输出的质量评分优于当前最先进的可微分WORLD声码器。