Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8\%. The code will be released.
翻译:大型视频语言模型(LVLMs)在各种视频语言任务中展现出卓越性能。然而,在处理长视频时,由于涉及大量视频帧,这些模型面临显著挑战。对长视频进行空间或时间下采样可能导致视觉幻觉,难以准确解析长视频内容。受人类分层时序搜索策略启发,我们提出\textbf{TimeSearch}——一种新颖的框架,使LVLMs能够以类人方式理解长视频。TimeSearch将两种类人基础机制整合到统一的自回归LVLM中:1)\textbf{聚焦}机制通过时序增强帧表示(TAFR)高效识别相关时序事件,显式地将视觉特征与时间戳绑定;2)\textbf{反思}机制评估已识别事件的正确性,利用LVLMs固有的时序自反思能力。TimeSearch基于反思置信度逐步探索关键事件并优化时序搜索优先级。在具有挑战性的长视频基准测试上的大量实验证实,TimeSearch显著超越先前最优方法,在LVBench上将准确率从41.8\%提升至51.5\%。此外,时序定位实验表明,恰当的TAFR足以以更简洁且通用的方式有效激发LVLMs惊人的时序定位能力,在Charades-STA数据集上将mIoU指标提升11.8\%。代码即将开源。