This paper considers the joint compression and enhancement problem for speech signal in the presence of noise. Recently, the SoundStream codec, which relies on end-to-end joint training of an encoder-decoder pair and a residual vector quantizer by a combination of adversarial and reconstruction losses,has shown very promising performance, especially in subjective perception quality. In this work, we provide a theoretical result to show that, to simultaneously achieve low distortion and high perception in the presence of noise, there exist an optimal two-stage optimization procedure for the joint compression and enhancement problem. This procedure firstly optimizes an encoder-decoder pair using only distortion loss and then fixes the encoder to optimize a perceptual decoder using perception loss. Based on this result, we construct a two-stage training framework for joint compression and enhancement of noisy speech signal. Unlike existing training methods which are heuristic, the proposed two-stage training method has a theoretical foundation. Finally, experimental results for various noise and bit-rate conditions are provided. The results demonstrate that a codec trained by the proposed framework can outperform SoundStream and other representative codecs in terms of both objective and subjective evaluation metrics. Code is available at \textit{https://github.com/jscscloris/SEStream}.
翻译:本文研究噪声环境下语音信号的联合压缩与增强问题。近年来,SoundStream编解码器采用基于对抗损失与重构损失相结合的编码器-解码器对与残差向量量化器联合端到端训练的方法,在主观感知质量方面展现出非常优越的性能。本工作从理论上证明:为在噪声环境下同时实现低失真与高感知质量,联合压缩与增强问题存在最优的两阶段优化流程。该流程首先仅使用失真损失优化编码器-解码器对,然后固定编码器,利用感知损失优化感知解码器。基于该结论,我们构建了一个针对含噪语音信号联合压缩与增强的两阶段训练框架。与现有基于启发式的训练方法不同,所提出的两阶段训练方法具有理论依据。最后,本文提供了多种噪声条件和比特率下的实验结果。结果表明,经所提框架训练的编解码器在客观与主观评价指标上均优于SoundStream及其他代表性编解码器。代码已发布于 \textit{https://github.com/jscscloris/SEStream}。