In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's performance. To mitigate these problems, we extend RaD-Net to its upgraded version, RaD-Net 2. Specifically, a causality-based knowledge distillation is introduced in the first stage to use future information in a causal way. We use the non-causal repairing network as the teacher to improve the performance of the causal repairing network. In addition, in the second stage, complex axial self-attention is applied in the denoising network's complex feature encoder/decoder. Experimental results on the ICASSP 2024 SSI Challenge blind test set show that RaD-Net 2 brings 0.10 OVRL DNSMOS improvement compared to RaD-Net.
翻译:在实时语音通信系统中,语音信号常受到多种失真影响。近期,一种两阶段的修复与去噪网络(RaD-Net)在ICASSP 2024语音信号改进挑战赛中展现出卓越的语音质量提升效果。然而,未能利用未来信息以及卷积层受限的感受野制约了系统性能。为缓解这些问题,我们将RaD-Net扩展至其升级版本RaD-Net 2。具体而言,第一阶段引入基于因果性的知识蒸馏机制,以因果方式利用未来信息。我们使用非因果修复网络作为教师模型来提升因果修复网络的性能。此外,在第二阶段,去噪网络的复数特征编码器/解码器中应用了复轴向自注意力机制。在ICASSP 2024 SSI挑战赛盲测集上的实验结果表明,相较于RaD-Net,RaD-Net 2实现了0.10 OVRL DNSMOS的性能提升。