Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.
翻译:多模态大语言模型(MLLMs)中的视觉-语言对齐通常依赖于监督微调(SFT)或强化学习(RL)。SFT方法稳定高效,但需要大规模人工标注且无法捕捉细微偏好;而RL引入了奖励信号进行训练,却存在开销大和不稳定的问题。这些局限性凸显了可扩展性、鲁棒性与对齐质量之间的权衡。为此,我们提出MergeMix,一种连接SFT与RL的训练时增强范式。该方法首先通过具有更丰富聚类表示和空间上下文信息的令牌合并进行注意力感知的图像混合,随后通过混合图像与原始图像构建偏好对,并利用SimPO损失进行优化,从而为MLLMs提出一种偏好驱动的训练范式。作为一种混合增强方法,MergeMix提升了注意力一致性与效率,在分类任务中超越了其他基于启发式的方法。大量实验表明,MergeMix在提升效率的同时实现了有竞争力的准确率,为分类任务及MLLMs中的偏好对齐提供了一种可扩展的途径。