Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive speech and noises as different types of sound events and propose an event-based query method for SE. Specifically, representative speech embeddings that can discriminate speech with noises are first pre-trained with the sound event detection (SED) task. The embeddings are then clustered into fixed golden speech queries to assist the SE network to enhance the speech from noisy audio. The golden speech queries can be obtained offline and generalizable to different SE datasets and networks. Therefore, little extra complexity is introduced and no enrollment is needed for each speaker. Experimental results show that the proposed method yields significant gains compared with baselines and the golden queries are well generalized to different datasets.
翻译:现有基于深度学习的语音增强(SE)方法要么采用盲端到端训练,要么将说话人嵌入或语音信息显式引入SE网络以提升语音质量。本文将语音和噪声视为不同类型的声音事件,提出一种基于事件的查询方法用于语音增强。具体而言,首先通过声音事件检测(SED)任务预训练能够区分语音与噪声的代表性语音嵌入,随后将这些嵌入聚类为固定的金标准语音查询,以辅助SE网络从含噪音频中增强语音。该金标准语音查询可离线获得,并能泛化至不同的SE数据集和网络,因此几乎不引入额外复杂度,且无需为每个说话人进行注册。实验结果表明,与基线方法相比,所提方法取得了显著性能提升,且金标准查询在不同数据集上具有良好的泛化能力。