Despite constituting 65% of all internet traffic in 2023, video content is underrepresented in generative AI research. Meanwhile, recent large language models (LLMs) have become increasingly integrated with capabilities in the visual modality. Integrating video with LLMs is a natural next step, so how can this gap be bridged? To advance video reasoning, we propose a new research direction of VideoCOT on video keyframes, which leverages the multimodal generative abilities of vision-language models to enhance video reasoning while reducing the computational complexity of processing hundreds or thousands of frames. We introduce VIP, an inference-time dataset that can be used to evaluate VideoCOT, containing 1) a variety of real-life videos with keyframes and corresponding unstructured and structured scene descriptions, and 2) two new video reasoning tasks: video infilling and scene prediction. We benchmark various vision-language models on VIP, demonstrating the potential to use vision-language models and LLMs to enhance video chain of thought reasoning.
翻译:尽管视频内容在2023年占据互联网流量的65%,但在生成式AI研究中仍未得到充分体现。与此同时,近期大语言模型(LLMs)已逐渐与视觉模态能力深度融合。将视频与LLMs相结合是自然的发展方向,那么如何弥合这一差距?为推进视频推理研究,我们提出基于视频关键帧的VideoCOT新研究方向,该方法利用视觉语言模型的多模态生成能力增强视频推理,同时降低处理数百或数千帧的计算复杂度。我们引入VIP推理时数据集用于评估VideoCOT,该数据集包含:1) 涵盖多种真实场景视频及其关键帧,并配有非结构化与结构化场景描述;2) 两项新型视频推理任务:视频填充与场景预测。我们在VIP上对多种视觉语言模型进行基准测试,证明了利用视觉语言模型与LLMs增强视频思维链推理的潜力。