This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.
翻译:本文提出TimeChat,一种专为长视频理解设计的时间敏感多模态大语言模型。该模型包含两项关键架构创新:(1)时间戳感知帧编码器,将视觉内容与每帧的时间戳绑定;(2)滑动视频Q-Former,生成可变长度的视频令牌序列以适应不同时长的视频。此外,我们构建了一个指令微调数据集,涵盖6项任务共12.5万个实例,以进一步增强TimeChat的指令遵循能力。在密集描述、时间定位、精彩片段检测等多种视频理解任务上的实验结果表明,TimeChat具备强大的零样本时间定位与推理能力。例如,与最先进的视频大语言模型相比,它在YouCook2上提升9.2 F1分数和2.8 CIDEr值,在QVHighlights上提升5.8 HIT@1,在Charades-STA上提升27.5 R@1(IoU=0.5),有望成为面向长视频理解任务的通用视频助手,满足真实用户需求。