Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.
翻译:尽管多模态大语言模型(MLLMs)取得了进展,但由于帧数和上下文长度的限制,当前方法在中长视频理解方面仍面临挑战。因此,这些模型通常依赖于帧采样,这不仅可能随时间推移丢失关键信息,还缺乏任务特定的相关性。为应对这些挑战,我们提出了HierarQ,一种基于任务感知的层次化Q-Former框架,它通过顺序处理视频帧来避免帧采样的需求,同时规避大语言模型的上下文长度限制。我们引入了一种轻量级的双流语言引导特征调制器,以在视频理解中融入任务感知能力:其中实体流在短上下文中捕获帧级别的物体信息,而场景流则在更长的时间范围内识别这些物体间更广泛的交互。每个流都由专用的记忆库支持,这使得我们提出的层次化查询变换器(HierarQ)能够有效捕捉短期和长期的上下文信息。在涵盖视频理解、问答和字幕生成任务的10个视频基准上的广泛评估表明,HierarQ在大多数数据集上均达到了最先进的性能,证明了其在全面视频分析中的鲁棒性和高效性。