Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
翻译:具备视频理解能力的多模态大语言模型(MLLMs)正在快速发展。为有效评估其视频理解能力,诸如Video-MME和MLVU等长视频理解基准被提出。然而,这些基准在测试时直接采用均匀帧采样,导致显著的信息丢失,影响了评估结果在反映MLLMs真实能力方面的准确性。为解决此问题,我们提出RAG-Adapter,一种即插即用的框架,它通过采样与给定问题最相关的帧来减少测试期间的信息丢失。此外,我们引入了一种分组监督对比学习(GCL)方法,通过在我们构建的MMAT数据集上进行微调,进一步提升RAG-Adapter的采样效果。最后,我们在多个视频理解基准上测试了众多基线MLLMs,发现RAG-Adapter采样始终优于均匀采样(例如,GPT-4o在Video-MME上的准确率提升了9.3%),为长视频基准提供了一种更准确的测试方法。