Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.
翻译:多模态大语言模型(MLLMs)通过利用特定指令微调数据集,在各种视觉问答与推理任务上取得了显著进展。它们还能从人工标注的偏好数据中学习,以增强推理能力并减少幻觉。大多数偏好数据由模型自身生成。然而,现有方法需要高质量的关键标签,这些标签成本高昂且依赖于人类或GPT-4V等专有模型。本文提出通过关键观察增强多模态大语言模型对齐能力的方法(EACO),该方法仅使用5千张图像,通过自生成的偏好数据经济高效地对齐MLLMs。我们的方法首先收集并精炼一个评分评估指令微调数据集,用于训练一个关键评估模型(称为Critic)。该Critic从多个维度观察模型响应,选择偏好与非偏好输出以进行精炼的直接偏好优化(DPO)微调。为进一步提升模型性能,我们在偏好微调后增加了一个监督微调阶段。EACO在HallusionBench上将整体幻觉减少了65.6%,并在MME-Cognition上将推理能力提升了21.8%。在多个基准测试中,EACO相比LLaVA-v1.6-Mistral-7B实现了8.5%的性能提升。值得注意的是,EACO还展示了开源MLLMs中潜在的关键能力,证明EACO是提升MLLMs能力的可行路径。