Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
翻译:统一多模态模型(UMMs)在集成理解、推理、生成与编辑功能时,面临着保持强大语义理解能力与获得强大生成能力之间的固有权衡。本报告提出了InternVL-U,一个轻量级的40亿参数UMM,旨在统一框架内民主化这些能力。在统一上下文建模原则以及解耦视觉表示的模态特定模块化设计指导下,InternVL-U集成了一个先进的多模态大语言模型(MLLM)与一个专门的基于MMDiT的视觉生成头。为了进一步弥合美学生成与高级智能之间的差距,我们构建了一个全面的数据合成流程,以推理为中心范式,针对高语义密度任务(如文本渲染和科学推理),利用思维链(CoT)更好地将抽象用户意图与细粒度视觉生成细节对齐。大量实验表明,InternVL-U实现了卓越的性能-效率平衡。尽管仅使用40亿参数,它在各种生成和编辑任务上持续优于规模超过其三倍的统一基线模型(如BAGEL(140亿)),同时保持了强大的多模态理解和推理能力。