Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.
翻译:摘要:音高估计是许多语音处理算法(包括语音编码、合成和增强)中的关键步骤。近年来,基于深度神经网络(DNN)的音高估计器在性能上已超越成熟的DSP技术。然而,这些新型估计器因复杂度较高且部分算法需要较大前瞻量,在实际实时系统中部署存在困难。我们证明:采用小型深度神经网络(DNN)结合传统DSP特征的混合估计器,能够在保持与传统DSP算法相当的复杂度和算法延迟的同时,达到或超越纯DNN模型的性能。我们进一步表明,这种混合方法可为神经声码器任务带来性能提升。