Previous studies have confirmed that by augmenting acoustic features with the place/manner of articulatory features, the speech enhancement (SE) process can be guided to consider the broad phonetic properties of the input speech when performing enhancement to attain performance improvements. In this paper, we explore the contextual information of articulatory attributes as additional information to further benefit SE. More specifically, we propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition (E2E-ASR) model that predicts the sequence of broad phonetic classes (BPCs). We also developed multi-objective training with ASR and perceptual losses to train the SE system based on a BPC-based E2E-ASR. Experimental results from speech denoising, speech dereverberation, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance. Moreover, the SE model trained with the BPC-based E2E-ASR outperforms that with the phoneme-based E2E-ASR. The results suggest that objectives with misclassification of phonemes by the ASR system may lead to imperfect feedback, and BPC could be a potentially better choice. Finally, it is noted that combining the most-confusable phonetic targets into the same BPC when calculating the additional objective can effectively improve the SE performance.
翻译:先前研究已证实,通过将发音部位/方式等发音特征与声学特征相结合,可在执行增强时引导语音增强(SE)过程考虑输入语音的宽泛语音属性,从而取得性能提升。本文探索将发音属性的上下文信息作为辅助信息以进一步优化SE性能。具体而言,我们提出通过利用端到端自动语音识别(E2E-ASR)模型(该模型预测宽泛音类(BPC)序列)的损失来改善SE性能。我们还开发了基于ASR损失和感知损失的多目标训练方法,用以训练基于BPC的E2E-ASR的SE系统。在语音降噪、语音解混响及受损语音增强任务上的实验结果证实,上下文BPC信息能提升SE性能。此外,基于BPC的E2E-ASR训练的SE模型优于基于音素的E2E-ASR训练的模型。结果表明,ASR系统对音素的错误分类可能导致不完美的反馈信号,而BPC可能是更优的选择。最后值得注意,在计算附加损失时将最易混淆的语音目标合并至同一BPC类别,可有效提升SE性能。