Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive speech and noises as different types of sound events and propose an event-based query method for SE. Specifically, representative speech embeddings that can discriminate speech with noises are first pre-trained with the sound event detection (SED) task. The embeddings are then clustered into fixed golden speech queries to assist the SE network to enhance the speech from noisy audio. The golden speech queries can be obtained offline and generalizable to different SE datasets and networks. Therefore, little extra complexity is introduced and no enrollment is needed for each speaker. Experimental results show that the proposed method yields significant gains compared with baselines and the golden queries are well generalized to different datasets.
翻译:现有的基于深度学习的语音增强方法要么采用盲端到端训练,要么将说话人嵌入或语音信息显式融入语音增强网络以提高语音质量。本文将语音和噪声视为不同类型的声音事件,提出了一种基于事件查询的语音增强方法。具体而言,首先通过声音事件检测任务预训练能够区分语音与噪声的代表性语音嵌入,然后将这些嵌入聚类为固定的黄金语音查询,以辅助语音增强网络从带噪音频中增强语音。该黄金语音查询可离线获取,并适用于不同的语音增强数据集和网络,因此几乎不增加额外复杂度,且无需对每位说话人进行注册。实验结果表明,与基线方法相比,本文方法取得了显著性能提升,且黄金查询在不同数据集上具有良好的泛化能力。