Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on speech tasks. These models convert continuous speech signals into discrete tokens (speech discretization) and merge text and speech tokens into a shared vocabulary. Then they train a single decoder-only Transformer on a mixture of speech tasks. Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens. In this paper, we attempt to model the sequence of speech tokens in an autoregressive manner like text. However, we find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over Loss Masking. Therefore, we propose a novel approach denoted Smoothed Label Distillation (SLD), which introduces a KL divergence loss with smoothed labels on the input speech tokens to effectively model speech tokens. Experiments demonstrate that our SLD approach alleviates the limitations of the cross-entropy loss and consistently outperforms Loss Masking for decoder-only Transformer based ASR using different speech discretization methods.
翻译:最近,诸如SpeechGPT、VioLA和AudioPaLM等统一语音-文本模型在语音任务上取得了显著性能。这些模型将连续语音信号转换为离散标记(语音离散化),并将文本和语音标记合并到共享词汇表中。随后,它们在混合语音任务上训练单个仅解码器Transformer。具体而言,所有这些模型在ASR任务中对输入语音标记使用损失掩码,这意味着它们不显式建模语音标记间的依赖性。本文尝试像文本一样,以自回归方式对语音标记序列进行建模。然而,我们发现对输入语音标记应用传统交叉熵损失并未持续优于损失掩码的ASR性能。因此,我们提出一种名为平滑标签蒸馏(SLD)的新方法,该方法通过对输入语音标记引入带平滑标签的KL散度损失,有效对语音标记进行建模。实验表明,我们的SLD方法缓解了交叉熵损失的限制,在使用不同语音离散化方法的仅解码器Transformer基ASR中持续优于损失掩码。