Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
翻译:统一视觉语言模型(UVLMs)需要在单一架构内同时执行理解与生成任务,然而这两类任务依赖于异构的数据与监督信号,使得在强化学习(RL)过程中难以实现平衡。本文提出PairUni,一种通过将数据重组为理解-生成(UG)配对并相应调整优化的统一框架。我们首先利用GPT-o3对单任务数据进行增强:为理解样本生成描述文本,为生成样本生成问答(QA)对,从而从同一实例构建对齐的配对。此外,针对每个生成样本,我们检索语义相关的理解样本以构建检索配对,从而关联不同但相关的数据点。这种配对结构揭示了跨任务的语义对应关系,并支持一致性的策略学习。为利用该结构,我们提出了Pair-GPRO——一种基于组相对策略优化的配对感知变体。该方法为每个配对分配相似度分数以调节优势函数,从而加强对齐良好样本的学习并减少任务间干扰。我们构建了一个包含16K个UG配对的高质量数据集PairUG用于RL微调,并在强大的Janus-Pro UVLM模型上评估PairUni。实验表明,我们的方法在多种UVLM上实现了均衡的性能提升,优于现有的强UVLM RL基线。代码地址:\\href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}