Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning. Existing coordination approaches either perform coupling during training, without explicit inference-time coordination, or impose a fixed coordination pattern for all inputs. In this work, we show that multimodal tasks exhibit substantial coordination-path diversity: different inputs favor different coordination paths. This suggests that exploiting such diversity is key to improving performance. We propose UniPath, a framework for adaptively modeling and exploiting coordination-path diversity. Instead of enforcing a single coordination pattern, we represent task solving as the selection and execution of a path, ranging from direct answering to textual inference, visual-thought construction, and hypothesis-based exploration. We construct role-aligned trajectories to train a path-conditioned executor and introduce a lightweight planner mechanism to enable input-dependent path selection. Experiments show that leveraging coordination-path diversity improves performance over fixed coordination strategies while providing interpretable intermediate behaviors. The code is available at:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath.
翻译:统一多模态模型(UMMs)旨在将理解与生成能力集成于单一架构中。然而,如何有效协同这两种能力以实现更高效、更有效的推理仍鲜有研究。现有协同方法要么仅在训练阶段进行耦合而缺乏显式的推理时协同,要么对所有输入施加固定的协同模式。本研究表明,多模态任务存在显著的协同路径多样性:不同输入偏好不同的协同路径,这表明利用这种多样性是提升性能的关键。我们提出UniPath框架,用于自适应地建模与利用协同路径多样性。不同于强制采用单一协同模式,我们将任务求解表示为路径的选择与执行,路径涵盖直接回答、文本推理、视觉思维构建及基于假设的探索。通过构建角色对齐轨迹以训练路径条件执行器,并引入轻量级规划器机制实现依赖输入的路径选择。实验表明,利用协同路径多样性较固定协同策略可提升性能,同时生成可解释的中间行为。代码见:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath