Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucination, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of hallucination and task). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including object existence, object attribute and object relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.
翻译:尽管在多模态任务中取得了显著进展,当前多模态大语言模型(MLLMs)仍面临幻觉这一重大挑战,可能导致有害后果。因此,评估MLLMs的幻觉现象对模型改进和实际应用部署日益重要。现有研究存在评估成本高(例如依赖人类或高级LLM)以及评估维度不足(例如幻觉类型和任务维度)的问题。本文提出一种无LLM的多维评估基准AMBER,可用于生成式任务和判别式任务的评估,包括物体存在性、物体属性及物体关系幻觉。基于AMBER,我们设计了低成本高效的评估流程。此外,我们针对包括GPT-4V(ision)在内的主流MLLMs进行了全面评估和详细分析,并提出了缓解幻觉的指导性建议。AMBER的数据与代码已开源:https://github.com/junyangwang0410/AMBER。