Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
翻译:近期,诸如SpeechGPT、VioLA和AudioPaLM等统一语音-文本模型在各类语音任务中取得了显著性能。这些模型将语音信号离散化为标记(语音离散化),并使用共享词汇表处理文本与语音标记。随后,它们基于混合语音任务训练单一的解码器型Transformer。然而,这些模型在自动语音识别任务中依赖"损失掩码"策略,忽视了语音标记间的依赖关系。本文提出以自回归方式建模语音标记(与文本建模方式类似)。我们发现,对输入语音标记应用传统交叉熵损失并未能在ASR性能上持续优于损失掩码方法。针对这一问题,我们提出名为"平滑标签蒸馏"(SLD)的新方法,该方法对语音标记应用带有平滑标签的KL散度损失。实验表明,SLD能有效建模语音标记,并在使用不同语音离散化方法的ASR任务中,使解码器型Transformer的性能超越损失掩码策略。源代码可访问:https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld