Self-supervised representation learning (SSL) has attained SOTA results on several downstream speech tasks, but SSL-based speech enhancement (SE) solutions still lag behind. To address this issue, we exploit three main ideas: (i) Transformer-based masking generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). In detail, conformer layers, leveraging an attention mechanism, are introduced to effectively model frame-level representations and obtain the Ideal Ratio Mask (IRM) for SE. Moreover, we incorporate consistency in the loss function, which processes the input to account for the inconsistency effects of signal reconstruction from the spectrogram. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. Evaluated on the VoiceBank-DEMAND task, the proposed solution outperforms previously SSL-based SE solutions when tested on several objective metrics, attaining a SOTA PESQ score of 3.54.
翻译:自监督表示学习(SSL)已在多项下游语音任务中取得最先进成果,但基于SSL的语音增强(SE)方案仍相对滞后。为解决此问题,我们提出了三项核心思路:(i)基于Transformer的掩码生成机制,(ii)一致性保持损失函数,以及(iii)感知对比度拉伸(PCS)。具体而言,我们引入采用注意力机制的Conformer层,以有效建模帧级表示并获取语音增强所需的最优比率掩码(IRM)。此外,我们在损失函数中融入一致性约束,通过处理输入信号以缓解从频谱图重建信号时的不一致效应。最后,采用PCS技术根据感知重要性提升输入与目标特征的对比度。在VoiceBank-DEMAND任务上的评估表明,所提方案在多项客观指标测试中均优于现有基于SSL的语音增强方案,并取得了3.54的当前最优PESQ分数。