Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.
翻译:当前的语音语言模型直接生成响应而无需显式推理,导致音频一旦生成便无法修正错误。我们提出\textbf{“无声思考,有声回答”}——一种语音大语言模型在生成口语响应的同时产生内部文本推理的范式,其思考轨迹可为语音质量提供信息。为实现此目标,我们提出\method{},首个基于扩散的、支持理解与生成的语音-文本语言模型,将离散文本与标记化语音统一于单一掩码扩散框架下。与自回归方法不同,\method{}通过迭代去噪联合生成推理轨迹与语音标记,并采用模态特定的掩码调度策略。我们还构建了\dataset{},首个包含配对文本推理轨迹的语音问答数据集,包含26K个样本总计319小时。实验表明,\method{}在语音到语音问答任务中达到了最先进的准确率,较最佳基线模型提升高达9个百分点,同时在生成模型中获得了最佳TTS质量(6.2\% 词错误率)并保持了语言理解能力(66.2\% MMLU得分)。消融实验证实,扩散架构与思考轨迹均对这些性能提升有所贡献。