Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. The dataset and code are here: https://selfcontradiction.github.io/.
翻译:大型多模态模型(LMMs)在执行人类指令方面表现出色。然而,随着多模态交互和上下文长度的日益增长,自相矛盾的指令可能随之产生,这对语言初学者和弱势群体构成了挑战。我们提出了自相矛盾指令基准测试,用以评估LMMs识别冲突指令的能力。该基准包含20,000个冲突案例,均匀分布于语言和视觉两种模态。它通过一种新颖的自动数据集构建框架创建,该框架加速了构建过程,使我们能够涵盖广泛的指令形式。我们的全面评估表明,当前LMMs由于缺乏自我意识,在识别多模态指令不一致性方面持续存在困难。因此,我们提出了认知觉醒提示法,通过引入外部认知来显著增强失调检测能力。数据集与代码详见:https://selfcontradiction.github.io/。