Multimodal Large Language Models (MLLMs) have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos-expanded to over 17K tampered clips spanning 19 video tasks. MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding. Code: https://amitbcp.github.io/MVTamperBench/ Data: https://huggingface.co/datasets/Srikant86/MVTamperBench
翻译:多模态大语言模型(MLLMs)推动了视频理解领域的重大进展,然而其对于对抗性篡改和操纵的脆弱性仍未得到充分探索。为填补这一空白,我们引入了MVTamperBench,这是一个系统评估MLLMs对五种常见篡改技术鲁棒性的基准:旋转、掩码、替换、重复和丢弃。该基准基于3.4K个原始视频构建,扩展为涵盖19个视频任务的超过17K个篡改视频片段。MVTamperBench挑战模型在空间和时间一致性中检测篡改的能力。我们评估了来自15个以上模型家族的45个近期MLLMs,揭示了不同篡改类型下模型鲁棒性的显著差异,并表明更大的参数量并不必然保证鲁棒性。MVTamperBench为在安全关键应用中开发抗篡改的MLLMs(包括检测点击诱饵、防止有害内容传播以及在媒体平台上执行政策)设立了新的基准。我们发布所有代码和数据,以促进可信视频理解领域的开放研究。代码:https://amitbcp.github.io/MVTamperBench/ 数据:https://huggingface.co/datasets/Srikant86/MVTamperBench