Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method.
翻译:近年来语音增强模型取得了显著进展,但其语音输出的感知质量仍存在局限。我们提出一种基于时序声学参数的感知质量目标函数。这些基础语音特征在说话人识别及副语言分析等各类应用中发挥着关键作用。我们为四类低级声学描述符提供了可微估计器,涉及:频率相关参数、能量或幅度相关参数、频谱平衡参数及时序特征。与先前聚焦聚合型声学参数或少数类别声学参数的研究不同,我们的时序声学参数损失能够在增强流程中实现对诸多精细语音特征的辅助优化与改善。实验表明,将TAPLoss作为辅助目标函数用于语音增强时,能够生成具有更优感知质量和可懂度的语音。我们采用深度噪声抑制2020挑战赛数据证明,时域模型和时频域模型均能受益于我们的方法。