Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.
翻译:大语言模型(LLMs)在文本理解方面展现出卓越的能力,这为其扩展至视频大语言模型(Vid-LLMs)以分析视频数据铺平了道路。然而,当前的Vid-LLMs难以同时保持高质量的帧级语义信息(即每帧有足够的令牌数量)和全面的视频级时序信息(即每个视频有充足的采样帧数)。这一限制阻碍了Vid-LLMs向细粒度视频理解方向的发展。为解决此问题,我们引入了SlowFocus机制,该机制在不损害帧级视觉令牌质量的前提下,显著提升了等效采样频率。SlowFocus首先基于所提问题识别与查询相关的时序片段,随后对该片段进行密集采样以提取局部高频特征。进一步利用多频混合注意力模块,将这些局部高频细节与全局低频上下文进行聚合,以增强时序理解能力。此外,为使Vid-LLMs适配这一创新机制,我们引入了一套训练策略,旨在增强模型的时序定位能力和细致的时序推理能力。同时,我们建立了FineAction-CGR基准,该基准专门设计用于评估Vid-LLMs处理细粒度时序理解任务的能力。全面的实验证明了我们的机制在现有公开视频理解基准和我们提出的FineAction-CGR基准上的优越性。