Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.
翻译:基于神经网络的方法显著推动了语音分离技术的发展,在信号级指标上展现出性能提升。然而,这些方法往往难以在分离信号中保持语音清晰度,这可能对语音识别等下游任务的性能产生负面影响。本文提出SLM-SS,一种将语音语言模型应用于语音分离的新方法,旨在增强分离信号的清晰度与连贯性。我们将语音分离构建为离散多码本序列生成任务,利用编码器-解码器模型将量化后的混合语音映射至目标标记。除自回归建模策略外,我们还引入非自回归模型以提升残差标记的解码效率。在LibriMix数据集上的实验结果表明,相较于现有方法,我们的方法能显著更好地保持语音清晰度,从而在各种下游任务中实现更高的语言一致性。