Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.
翻译:基频估计是许多语音处理算法的关键步骤,包括语音编码、合成与增强。近年来,基于深度神经网络(DNN)的基频估计器在性能上已超越成熟的DSP技术。然而,这些新型估计器在实时系统部署中可能并不实际,原因在于其相对较高的复杂度,以及部分方法需要较大的前瞻量。我们证明,一种使用小型深度神经网络(DNN)结合传统DSP特征的混合估计器,可以在复杂度和算法延迟与经典DSP算法相当的情况下,达到或超越纯DNN模型的性能。我们进一步证明,这种混合方法还能为神经声码器任务带来益处。