PairUni：面向统一多模态语言模型的成对训练方法 (PairUni: Pairwise Training for Unified Multimodal Language Models)

Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

翻译：统一视觉语言模型（UVLMs）需要在单一架构内同时执行理解与生成任务，然而这两类任务依赖于异构的数据与监督信号，使得在强化学习（RL）过程中难以实现平衡。本文提出PairUni，一种通过将数据重组为理解-生成（UG）配对并相应调整优化的统一框架。我们首先利用GPT-o3对单任务数据进行增强：为理解样本生成描述文本，为生成样本生成问答（QA）对，从而从同一实例构建对齐的配对。此外，针对每个生成样本，我们检索语义相关的理解样本以构建检索配对，从而关联不同但相关的数据点。这种配对结构揭示了跨任务的语义对应关系，并支持一致性的策略学习。为利用该结构，我们提出了Pair-GPRO——一种基于组相对策略优化的配对感知变体。该方法为每个配对分配相似度分数以调节优势函数，从而加强对齐良好样本的学习并减少任务间干扰。我们构建了一个包含16K个UG配对的高质量数据集PairUG用于RL微调，并在强大的Janus-Pro UVLM模型上评估PairUni。实验表明，我们的方法在多种UVLM上实现了均衡的性能提升，优于现有的强UVLM RL基线。代码地址：\\href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日