Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm and propose the EvALign-ICL framework, in which we (1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture such as OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. To efficiently address these problems, and inspired by the success of in-context learning (ICL) in LLMs, (2) we explore ICL as a solution and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL approaches such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows; (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, abstention, and instruction following, ICL does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://evalign-icl.github.io/

翻译：继大型语言模型（LLMs）成功后，大型多模态模型（LMMs）（如Flamingo模型及其后续竞争者）已开始作为通用智能体的自然演进方向涌现。然而，与近期LMMs的交互揭示了重大局限性，这些局限性难以被当前评估基准完全捕捉。实际上，仅凭任务性能（如VQA准确率）无法充分揭示其真实能力、局限性及与人类期望的契合程度。为深化对这些缺陷的理解，我们突破现有评估范式，提出EvAlign-ICL框架，其中：（1）基于五个不同维度（幻觉、弃权、组合性、可解释性及指令遵循）评估了八个近期开源LMMs（基于Flamingo架构，如OpenFlamingo和IDEFICS）。评估结果揭示了LMMs的重大缺陷。为有效解决这些问题，受LLM中上下文学习（ICL）成功的启发，（2）我们探索ICL作为解决方案，并研究其对这些局限性的影响。基于ICL研究，（3）我们进一步扩展ICL，提出新型多模态ICL方法，包括多任务ICL、事后推理链ICL及自纠正ICL。研究结论如下：（1）尽管LMMs取得一定成功，其缺陷并未随规模扩大而消除；（2）ICL对LMMs缺陷的影响具有复杂性：尽管ICL能有效提升可解释性、弃权能力和指令遵循，但无法改善组合能力，甚至反而加剧幻觉现象；（3）所提出的ICL变体作为事后处理方法，在针对性解决部分缺陷方面具有潜力。代码详见：https://evalign-icl.github.io/