Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.
翻译:近年来,大型语言模型(LLMs)的进展引发了将其能力扩展到多模态场景的广泛兴趣,特别是在语音到语音对话系统方面。然而,现有的处理交错音频与文本的多模态模型依赖于自回归(AR)方法,忽视了文本主要依赖于目标-目标关系,而音频则主要依赖于源-目标关系。在本工作中,我们提出了Text-to-Talk(TtT),一个统一的音频-文本框架,它将AR文本生成与非自回归(NAR)音频扩散集成在单个Transformer中。通过利用吸收式离散扩散的任意顺序AR特性,我们的方法为文本和音频提供了统一的训练目标。为了支持这种混合生成范式,我们设计了一种模态感知注意力机制,该机制强制文本进行因果解码,同时允许在音频片段内进行双向建模,并进一步引入了三种减少训练-测试差异的训练策略。在推理过程中,TtT采用分块扩散并行合成音频,同时灵活处理可变长度的输出。在Audio-QA、ASR、AAC和语音到语音基准测试上的综合实验表明,TtT始终优于强大的AR和NAR基线模型,额外的消融实验和训练策略分析也证实了每个组件的贡献。我们将开源我们的模型、数据和代码,以促进该方向的未来研究。