Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic information is often represented by quantizing latent features of self-supervised learning (SSL) models. This work is the first to incorporate SSL features with causality into an SE model. The causal SSL features are encoded and combined with spectrogram features using feature-wise linear modulation to estimate a mask for enhancing the noisy input speech. Simultaneously, we quantize the causal SSL features using vector quantization to represent phonetic characteristics as semantic tokens. The model not only encodes SSL features but also predicts the future semantic tokens in multi-task learning (MTL). The experimental results using VoiceBank + DEMAND dataset show that our proposed method achieves 2.88 in PESQ, especially with semantic prediction MTL, in which we confirm that the semantic prediction played an important role in causal SE.
翻译:实时语音增强对在线语音通信至关重要。因果语音增强模型仅利用历史上下文,而预测未来信息(如音素延续)可能有助于实现因果语音增强。语音信息通常通过对自监督学习模型的潜在特征进行量化来表示。本研究首次将具有因果性的SSL特征整合到语音增强模型中。因果SSL特征经过编码后,通过特征级线性调制与语谱图特征结合,以估计用于增强带噪输入语音的掩码。同时,我们使用矢量量化对因果SSL特征进行量化,将语音特征表示为语义标记。该模型不仅编码SSL特征,还在多任务学习中预测未来语义标记。基于VoiceBank+DEMAND数据集的实验结果表明,所提方法在PESQ指标上达到2.88,特别是在包含语义预测的多任务学习配置中,我们证实了语义预测在因果语音增强中发挥了重要作用。