This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code will be available.
翻译:本文首次尝试在视觉语言模型(VLMs)中实现无监督偏好对齐。我们针对原始图像与增强图像对生成所选及被拒响应,并采用直接偏好优化进行偏好对齐。其核心思想在于:对图像输入进行合理设计的增强操作,将诱导VLM生成错误但难度较高的负例响应,从而帮助模型从中学习并产生更稳健、更强大的答案。整个流程在对齐过程中无需依赖GPT4或人工标注的监督,且仅需数行代码即可高效实现。在仅使用8k随机采样的无监督数据条件下,该方法在LLaVA-Bench复杂推理任务上达到GPT-4得分的90%,并在复杂多模态基准MM-Vet上将LLaVA-7B/13B的得分分别提升6.7%/5.6%。可视化结果表明模型对齐用户意图的能力得到增强。我们通过一系列严格的消融实验揭示了该方法的潜在机制,同时表明其具备进一步扩展的潜力。代码将开源。