Frame sampling is a fundamental component in video understanding and video--language model pipelines, yet evaluating the quality of sampled frames remains challenging. Existing evaluation metrics primarily focus on perceptual quality or reconstruction fidelity, and are not designed to assess whether a set of sampled frames adequately captures informative and representative video content. We propose Spatio-Temporal Entropy Coverage (STEC), a simple and non-reference metric for evaluating the effectiveness of video frame sampling. STEC builds upon Spatio-Temporal Frame Entropy (STFE), which measures per-frame spatial information via entropy-based structural complexity, and evaluates sampled frames based on their temporal coverage and redundancy. By jointly modeling spatial information strength, temporal dispersion, and non-redundancy, STEC provides a principled and lightweight measure of sampling quality. Experiments on the MSR-VTT test-1k benchmark demonstrate that STEC clearly differentiates common sampling strategies, including random, uniform, and content-aware methods. We further show that STEC reveals robustness patterns across individual videos that are not captured by average performance alone, highlighting its practical value as a general-purpose evaluation tool for efficient video understanding. We emphasize that STEC is not designed to predict downstream task accuracy, but to provide a task-agnostic diagnostic signal for analyzing frame sampling behavior under constrained budgets.
翻译:帧采样是视频理解与视频-语言模型流水线中的基础组件,然而评估采样帧的质量仍具挑战性。现有评估指标主要关注感知质量或重建保真度,并非旨在评估一组采样帧是否充分捕获了信息丰富且具代表性的视频内容。我们提出时空熵覆盖(STEC),这是一种用于评估视频帧采样有效性的简单且无需参考的度量标准。STEC建立在时空帧熵(STFE)之上,后者通过基于熵的结构复杂性度量每帧的空间信息,并依据采样帧的时间覆盖度与冗余度对其进行评估。通过联合建模空间信息强度、时间分散性以及非冗余性,STEC提供了一种原则性且轻量级的采样质量度量方法。在MSR-VTT test-1k基准测试上的实验表明,STEC能够清晰区分常见的采样策略,包括随机、均匀以及内容感知方法。我们进一步证明,STEC揭示了单视频间的鲁棒性模式,这些模式无法仅通过平均性能捕捉,凸显了其作为高效视频理解通用评估工具的实际价值。我们强调,STEC并非旨在预测下游任务精度,而是为分析有限预算下的帧采样行为提供一种与任务无关的诊断信号。