Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.
翻译:统一视觉-语言模型(UVLMs)需在单一架构内同时执行理解与生成任务,但这两类任务依赖于异构数据与监督信号,导致在强化学习(RL)过程中难以实现平衡。本文提出PairUni——一种通过重组数据为理解-生成(UG)配对并相应调整优化的统一框架。我们首先利用GPT-4o增强单任务数据:为理解样本生成描述文本,为生成样本生成问答(QA)对,从而基于同一实例构建对齐配对。此外,针对每个生成样本,我们检索语义相关的理解样本以构建检索配对,从而关联不同但相关的数据点。这些配对结构显式揭示了跨任务的语义对应关系,并支持一致性的策略学习。为利用此结构,我们提出Pair-GPRO——基于分组相对策略优化的配对感知变体。该方法为每个配对分配相似度分数以调节优势函数,从而增强对齐良好样本的学习效果并减少任务间干扰。我们构建了包含16K个UG配对的高质量数据集PairUG用于RL微调,并在强大的Janus-Pro UVLM模型上评估PairUni。实验表明,该方法在多种UVLM上实现了均衡的性能提升,优于现有强基线UVLM RL方法。代码已开源:https://github.com/Haochen-Wang409/PairUni。