Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.
翻译:视频大语言模型(VideoLLMs)近期在视频理解任务中取得了强劲的性能。然而,我们发现了一种先前未被充分探索的生成失败模式:严重的输出重复,即模型退化为自我强化的重复短语或句子循环。现有VideoLLM基准测试主要关注任务准确性和事实正确性,未能捕捉到这种失败模式。我们提出了VideoSTF,这是首个系统化测量和压力测试VideoLLMs输出重复的框架。VideoSTF使用三种互补的基于n-gram的指标形式化定义重复,并提供了一个包含10,000个多样化视频的标准化测试平台以及一个受控时序变换库。利用VideoSTF,我们对10个先进的VideoLLMs进行了普遍性测试、时序压力测试和对抗性利用。我们发现输出重复现象普遍存在,并且关键的是,对视频输入的时序扰动高度敏感。此外,我们证明了简单的时序变换可以在黑盒设置中高效诱发重复退化,从而暴露出输出重复作为一种可被利用的安全漏洞。我们的研究结果揭示了输出重复是现代VideoLLMs的一个基本稳定性问题,并推动了视频-语言系统稳定性感知评估的发展。我们的评估代码和脚本可在以下网址获取:https://github.com/yuxincao22/VideoSTF_benchmark。