Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This ``one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe ``modal noise'' for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and Contextual Alignment for subtitle-driven narratives. Crucially, Q-Gate introduces a Query-Modulated Gating Mechanism that leverages the in-context reasoning of an LLM to assess the query's intent and dynamically allocate attention weights across the experts. This mechanism intelligently activates necessary modalities while ``muting'' irrelevant ones, thereby maximizing the signal-to-noise ratio. Extensive experiments on LongVideoBench and Video-MME across multiple MLLM backbones demonstrate that Q-Gate substantially outperforms state-of-the-art baselines. By effectively suppressing modality-specific noise, it provides a robust, highly interpretable solution for scalable video reasoning.

翻译：长视频理解对多模态大语言模型（MLLMs）而言仍是一项严峻挑战，其根源在于处理密集帧序列所需的高昂计算成本。当前主流的解决方案是选取关键帧子集，但通常仅依赖单一视觉中心指标（如CLIP相似度）或基于启发式分数的静态融合。这种“一刀切”范式往往失效：视觉指标对剧情驱动的叙事查询效果不佳，而盲目引入文本分数则会在纯视觉任务中引发严重的“模态噪声”。为突破这一瓶颈，我们提出Q-Gate——一种即插即用且无需训练的框架，将关键帧选择重构为动态模态路由问题。我们将检索过程解耦为三个轻量级专家流：面向局部细节的视觉定位、面向场景语义的全局匹配，以及面向字幕驱动叙事的上下文对齐。关键之处在于，Q-Gate引入了一种查询调制门控机制，利用大语言模型（LLM）的上下文推理能力评估查询意图，并动态分配各专家的注意力权重。该机制可智能激活必要模态，同时“静音”无关模态，从而最大化信噪比。在LongVideoBench和Video-MME上基于多种MLLM骨干网络的广泛实验表明，Q-Gate显著优于当前最先进基线方法。通过有效抑制模态特定噪声，它为可扩展视频推理提供了一种鲁棒且高度可解释的解决方案。