We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.
翻译:我们提出了MIA-Bench,这是一个旨在评估多模态大语言模型(MLLMs)严格遵循复杂指令能力的新基准。该基准包含400个多样化的图像-提示对,每个样本均经过精心设计,以挑战模型在生成满足特定要求模式的准确响应时对多层次指令的遵从性。对一系列前沿MLLMs的评估结果显示,其性能存在显著差异,凸显了在指令忠实度方面有待改进的空间。此外,我们创建了额外的训练数据,并探索了监督微调方法,以增强模型严格遵循指令的能力,同时不影响其他任务的性能。我们希望该基准不仅能作为衡量MLLM指令遵循能力的工具,还能为未来MLLM训练方法的发展提供指引。