Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we exploring injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Our code will be released publicly.
翻译:大型语言模型(LLM)使得近期基于LLM的方法在长视频理解基准测试中取得了优异性能。我们研究了底层LLM所具备的广泛世界知识与强大推理能力如何影响这一卓越表现。令人惊讶的是,我们发现基于LLM的方法即使在视频信息有限、甚至完全没有视频特定信息的情况下,也能在长视频任务上获得出人意料的高准确率。基于此发现,我们探索了如何将视频特定信息注入基于LLM的框架。我们利用现成的视觉工具从视频中提取三种以物体为中心的信息模态,随后借助自然语言作为融合这些信息的媒介。由此构建的多模态视频理解(MVU)框架在多个视频理解基准测试中均展现出最先进的性能。其在机器人领域任务上的强劲表现进一步证明了该框架具有卓越的泛化能力。我们的代码将公开发布。