Recent advancements in Vision-Language Models (VLMs) have enabled significant progress in complex video understanding tasks. However, their robustness to real-world manipulations remains underexplored, limiting their reliability in critical applications. To address this gap, we introduce MVTamperBench, a comprehensive benchmark designed to evaluate VLM's resilience to video tampering effects, including rotation, dropping, masking, substitution, and repetition. By systematically assessing state-of-the-art models, MVTamperBench reveals substantial variability in robustness, with models like InternVL2-8B achieving high performance, while others, such as Llama-VILA1.5-8B, exhibit severe vulnerabilities. To foster broader adoption and reproducibility, MVTamperBench is integrated into VLMEvalKit, a modular evaluation toolkit, enabling streamlined testing and facilitating advancements in model robustness. Our benchmark represents a critical step towards developing tamper-resilient VLMs, ensuring their dependability in real-world scenarios. Project Page: https://amitbcp.github.io/MVTamperBench/
翻译:近期,视觉-语言模型(VLMs)的进展在复杂视频理解任务中取得了显著进步。然而,它们对现实世界篡改操作的鲁棒性仍未得到充分探索,这限制了其在关键应用中的可靠性。为填补这一空白,我们提出了MVTamperBench,一个旨在评估VLM对视频篡改效应(包括旋转、丢帧、掩码、替换和重复)抵抗力的综合性基准测试。通过对最先进模型进行系统评估,MVTamperBench揭示了鲁棒性存在显著差异:例如InternVL2-8B等模型表现出高性能,而Llama-VILA1.5-8B等其他模型则存在严重脆弱性。为促进更广泛的采用和可复现性,MVTamperBench已集成至模块化评估工具包VLMEvalKit中,可实现标准化测试并推动模型鲁棒性研究的发展。本基准测试是迈向开发抗篡改VLM的关键一步,以确保其在现实场景中的可靠性。项目页面:https://amitbcp.github.io/MVTamperBench/