分而治之，精准定位：面向查询类型的帧选择适配方法在长视频理解中的应用 (Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding)

The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

翻译：大型多模态模型在长视频理解中的应用受到有限上下文长度和处理密集视频令牌计算成本过高的制约。因此，近期研究集中于查询感知的帧选择方法，但这些方法通常带来显著的计算开销。本文挑战了此类复杂搜索机制普遍必要的假设。我们首先识别并验证了一种区分全局查询与局部化查询的查询类型学。研究表明，均匀采样对于全局查询既高效又有效，而局部化查询确实需要查询感知的选择以实现最优性能。基于这一洞见，我们提出了DIG，一种无需训练、根据查询类型自适应调整策略的帧选择框架。具体而言，DIG对全局查询采用高效的均匀采样，同时为局部化查询激活专用流水线以提取查询相关帧。在三个长视频理解基准测试上的实验表明，DIG始终优于现有基线，并稳健提升了大型多模态模型的性能，即使在输入帧数扩展至256时亦然。

相关内容