MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

from arxiv, Data available at https://huggingface.co/datasets/FBK-MT/MCIF | Evaluation, outputs, and baselines available at https://github.com/hlt-mt/mcif

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs' abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

翻译：大型语言模型的最新进展为多模态大语言模型奠定了基础，这些模型将文本、语音和视觉统一在单一框架内。随着这些模型在多样化和复杂任务中快速向通用指令遵循能力演进，一个关键前沿是评估其在长短输入上的跨语言和多模态能力。然而，现有基准在联合评估这些维度方面存在不足：通常仅限于英语、主要关注单一模态、依赖短输入形式或缺乏人工标注——这阻碍了跨语言、跨模态和跨任务复杂度的模型性能综合评估。为弥补这些不足，我们提出了MCIF（多模态跨语言指令遵循基准），这是首个基于自然语言处理等领域科学讲座的跨语言人工标注基准。MCIF通过不同输入长度评估跨语言多模态场景下的指令遵循能力，涵盖识别、翻译、问答和摘要四大宏观任务。该基准覆盖语音、视觉和文本三种核心模态，以及英语、德语、意大利语和中文四种语言，所有维度完全对齐。这种并行设计能够系统评估多模态大语言模型跨语言理解指令并有效整合多模态上下文信息的能力。我们对23个模型的基准测试和分析揭示了跨模态和跨任务的普遍挑战，表明未来多模态大语言模型发展仍有巨大改进空间。MCIF基于CC-BY 4.0协议发布以促进开放研究。