Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
翻译:多模态大语言模型(MLLM)的分类性能关键取决于评估方案和真实标注质量。现有比较MLLM与监督学习模型及视觉语言模型的研究常得出矛盾结论,本文揭示这些矛盾源于评估方案对性能的高估或低估。通过对主流评估方案的系统性分析,我们识别并修正了以下核心问题:模型输出超出预设类别列表而被错误丢弃、因干扰项设计薄弱导致性能虚高、以及因输出映射不当造成开放世界设定表现不佳。此外,我们量化了常被忽视的设计选择(批量大小、图像顺序、文本编码器选择)对准确率的显著影响。基于ReGT——我们对ImageNet-1k中625个类别的多标签重标注数据集——的评估表明,修正标注使MLLM获得最大性能提升(最高达+10.8%),显著缩小了其与监督模型的感知差距。因此,现有文献中MLLM在分类任务上的表现不足,很大程度上源于噪声标注和有缺陷的评估方案,而非模型本质缺陷。对监督训练信号依赖程度较低的模型对标注质量最为敏感。最后,我们证明MLLM可辅助人工标注:在受控案例研究中,标注者在约50%的困难案例中确认或整合了MLLM的预测结果,这展现了其在大规模数据集构建中的应用潜力。