Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
翻译:多模态大语言模型(MLLMs)的最新进展显著推动了视频理解的发展。然而,其在长视频上的性能仍受限于计算约束和次优的帧选择策略。本文提出了Think-Clip-Sample(TCS),一个无需训练的框架,通过两个关键组件来增强长视频理解能力:(i)多查询推理,其生成多个查询以捕捉问题与视频的互补性方面;(ii)剪辑级慢-快采样,其自适应地平衡密集的局部细节与稀疏的全局上下文。在MLVU、LongVideoBench和VideoMME上进行的大量实验表明,TCS能够持续提升不同MLL模型的性能,最高可提升6.9%的准确率,并且能够在减少50%推理时间成本的情况下达到可比的准确率,这突显了TCS在长视频理解任务上的高效性与有效性。