Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.
翻译:当前基于语音的大语言模型主要在大规模ASR与TTS数据集上训练,在相关领域任务中表现优异。然而,这些模型处理直接语音对话的能力仍存在显著局限。现有模型通常依赖ASR到TTS的思维链流水线,需先将语音转换为文本进行处理再生成音频响应,这既引入延迟又损失音频特征。我们提出一种将ASR思维链隐式内化至语音大语言模型的方法,从而增强其原生语音理解能力。该方案有效降低延迟并提升模型对语音的原生理解,为构建更高效自然的实时音频交互系统开辟新路径。我们还发布了大规模合成对话数据集以促进后续研究。