Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
翻译:统一多模态预训练已成为一种有前景的范式,可在单个基础模型中联合建模语言与视觉。然而,现有方法大多依赖隐式或间接的对齐信号,在同时支持多模态理解与生成(尤其需要细粒度语言-视觉推理与可控生成的场景)方面仍表现次优。本文提出LVRPO,一种基于语言-视觉强化偏好优化的框架,通过分组相对策略优化(GRPO)显式对齐语言与视觉表示。与在表示层面引入额外对齐损失不同,LVRPO通过偏好驱动的强化信号直接优化多模态模型行为,促进语言与视觉在理解与生成任务中的一致性和语义根基性交互。该形式化方法无需辅助编码器或手工设计的跨模态目标即可实现有效对齐,并自然扩展至多样化多模态能力。实验表明,LVRPO在涵盖多模态理解、生成与推理的广泛基准测试中持续优于强统一预训练基线。