Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.
翻译:现实世界的感知与交互本质上是多模态的,不仅包含语言,还涉及视觉与语音,这推动了支持多模态输入与多模态输出的“全能”多模态大语言模型的发展。尽管一系列全能多模态大语言模型已相继出现,但现有系统大多仍依赖额外的专家组件来实现多模态生成,限制了统一训练与推理的简洁性。自回归建模凭借单一令牌流、单一下一令牌目标以及单一解码器,在文本领域是一种优雅且可扩展的基础范式。受此启发,我们提出了AR-Omni,一种在自回归范式下无需任何专家解码器的统一任意模态到任意模态生成模型。AR-Omni支持自回归文本与图像生成,以及流式语音生成,所有这些功能均由单一Transformer解码器实现。我们进一步解决了统一自回归建模中的三个实际问题:通过任务感知损失重加权应对模态不平衡问题,通过针对图像令牌的轻量级令牌级感知对齐损失提升视觉保真度,以及通过有限状态解码机制权衡稳定性与创造性。实验表明,AR-Omni在三种模态上均实现了强大的生成质量,同时保持实时性,其语音生成的实时因子达到0.88。