Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
翻译:统一模型能够在单一架构内同时处理多模态理解与生成任务,但通常以单次通过方式运行,无法迭代优化输出。许多多模态任务(尤其涉及复杂空间组合、多物体交互或动态指令的场景)需要分解指令、验证中间结果并执行迭代修正。尽管测试时扩展(TTS)已证明通过额外分配推理计算量来实现迭代推理可显著提升语言模型性能,但将该范式扩展至统一多模态模型仍是一项开放性挑战。本文提出UniT——一种多模态链式思维测试时扩展框架,使单一统一模型能够跨多轮次进行推理、验证与精化。UniT融合智能体数据合成、统一模型训练与灵活测试时推理,激发验证、子目标分解及内容记忆等认知行为。我们的核心发现包括:(1) 基于短推理轨迹训练的统一模型可在测试时泛化至更长的推理链;(2) 相较于并行采样,顺序链式思维推理能提供更具可扩展性与计算效率的TTS策略;(3) 基于生成与编辑轨迹的训练可提升分布外视觉推理能力。这些结果确立了多模态测试时扩展作为推进统一模型生成与理解能力的有效范式。