Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.

翻译：继大型语言模型（LLMs）成功之后，大型多模态模型（LMMs）（如Flamingo模型及其后续竞品）开始作为通向通用型智能体的自然步骤而涌现。然而，与近期LMMs的交互揭示了当前评估基准难以捕捉的重大局限性。事实上，仅凭任务性能（如VQA准确率）无法充分揭示其真实能力、局限性以及这些模型在多大程度上符合人类期望。为了更深入地理解这些缺陷，我们偏离了当前的评估范式，并（1）在5个不同维度（幻觉、弃权、组合性、可解释性和指令遵循）上评估了10个近期开源LMMs（参数量从3B到80B）。这些维度的评估揭示了LMMs存在的重大缺陷。虽然当前对齐这些模型的常规方案基于训练（如指令微调或RLHF），但我们（2）探索了免训练的上下文学习（ICL）作为解决方案，并研究其如何影响这些局限性。基于我们的ICL研究，（3）我们进一步推动了ICL的发展，提出了新的多模态ICL变体，例如：多任务ICL、事后链式ICL、自纠正ICL。我们的发现如下：（1）尽管取得了成功，但LMMs仍存在仅凭规模扩展无法解决的缺陷。（2）ICL对LMMs缺陷的影响具有细微差别：尽管它能有效提升可解释性和答案弃权能力，但ICL仅略微改善指令遵循，无法提升组合能力，甚至反而放大了幻觉问题。（3）所提出的ICL变体作为事后处理方法，有望高效解决其中部分缺陷。代码地址：https://github.com/mshukor/EvALign-ICL。