Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Elmira Amirloo,Jean-Philippe Fauconnier,Christoph Roesmann,Christian Kerl,Rinu Boney,Yusu Qian,Zirui Wang,Afshin Dehghan,Yinfei Yang,Zhe Gan,Peter Grasch

Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.

翻译：偏好对齐已成为提升大语言模型（LLMs）性能的关键组成部分，但其在多模态大语言模型（MLLMs）中的影响仍相对缺乏深入探索。与语言模型类似，用于图像理解任务的MLLMs面临着诸如幻觉等挑战。在MLLMs中，幻觉不仅可能表现为陈述错误事实，还可能产生与图像内容不一致的响应。MLLMs对齐的一个主要目标是促使这些模型生成的响应更紧密地与图像信息保持一致。近期，多项研究引入了针对MLLMs的偏好数据集，并考察了不同的对齐方法，包括直接偏好优化（DPO）和近端策略优化（PPO）。然而，由于数据集、基础模型类型和对齐方法存在差异，目前尚不清楚这些工作中报告的改进究竟主要归因于哪些具体因素。本文独立分析了MLLMs中偏好对齐的各个方面。我们首先将对齐算法分为离线（如DPO）和在线（如在线DPO）两类，并证明在某些场景下结合离线和在线方法可以提升模型性能。我们回顾了多种已发布的多模态偏好数据集，并讨论了其构建细节如何影响模型表现。基于这些见解，我们提出了一种创建多模态偏好数据的新方法，称为偏置驱动幻觉采样（BDHS）。该方法既不需要额外标注，也无需借助外部模型，实验表明其在多项基准测试中能够达到与先前发布的多模态模型对齐工作相竞争的性能水平。