Evaluating Video Language Models (VLMs) is a challenging task. Due to its transparency, Multiple-Choice Question Answering (MCQA) is widely used to measure the performance of these models through accuracy. However, existing MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias, when models disproportionately favor certain answer options based on positional patterns observed during training. In this work, we conduct a comprehensive empirical analysis of several VLM architectures across major datasets designed to assess complex video-focused reasoning. We identify where the bias is most pronounced and demonstrate to what extent model responses reflect genuine understanding of video content and related questions, as opposed to reliance on arbitrary patterns or superficial cues, such as answer position. By decomposing the MCQA task and adapting fairness bias metrics to VLMs, we introduce a post-processing calibration technique BOLD to balance this bias. Our results show that reducing selection bias improves not only debiasing metrics but also overall model performance, including Accuracy and F1 Mean score. Our method, by suppressing "blind guessing", offers a more cost- and time-effective approach to mitigating selection bias compared to existing techniques. This study represents the first focused investigation of selection bias in video-to-text LLM-powered models.
翻译:评估视频语言模型(VLMs)是一项具有挑战性的任务。由于其透明性,多项选择题回答(MCQA)被广泛用于通过准确率来衡量这些模型的性能。然而,现有的MCQA基准测试未能充分捕捉VLMs的全部推理能力,这是由于存在选择偏差,即模型在训练过程中观察到某些位置模式后,会不成比例地偏向特定的答案选项。在本研究中,我们对几种VLM架构在多个旨在评估复杂视频推理能力的主要数据集上进行了全面的实证分析。我们识别了偏差最为显著的情况,并证明了模型响应在多大程度上反映了对视频内容及相关问题的真实理解,而非依赖于任意模式或表面线索(如答案位置)。通过分解MCQA任务并将公平性偏差度量指标适配于VLMs,我们引入了一种后处理校准技术BOLD来平衡这种偏差。我们的结果表明,减少选择偏差不仅能改善去偏差度量指标,还能提升模型的整体性能,包括准确率和F1均值分数。我们的方法通过抑制“盲目猜测”,提供了一种比现有技术更具成本和时间效益的途径来缓解选择偏差。本研究首次对视频到文本的LLM驱动模型中的选择偏差进行了聚焦性探究。