Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs' instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs' instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.
翻译:评估多模态大语言模型(MLLMs)的指令遵循(IF)能力对于严格评估模型输出如何忠实遵循用户指定的意图至关重要。然而,现有评估MLLMs指令遵循能力的基准主要关注文本模态中的语言指令。这些局限性阻碍了对指令遵循能力的全面分析,因为它们忽略了语义丰富的视觉模态中蕴含的隐含约束。为弥补这一空白,我们提出了VC-IFEval——一个伴随系统构建数据集的新基准,用于评估MLLMs在多模态设置下的指令遵循能力。我们的基准系统地将视觉依赖的约束纳入指令设计,从而能够对MLLMs的输出与视觉输入及文本指令的匹配程度进行更严格、更细粒度的评估。此外,通过在所构建数据集上对MLLMs进行微调,我们在视觉指令遵循的准确性和遵从性方面取得了显著提升。通过对代表性MLLMs的广泛评估,我们为当前模型的优势与局限性提供了新的见解。