In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
翻译:在本研究中,我们通过引入VideoChat(一种端到端的以对话为核心的视频理解系统)开启了对视频理解的探索。该系统通过可学习的神经接口集成视频基础模型与大型语言模型,在时空推理、事件定位及因果关系推理方面表现出色。为了对该系统进行指导性调优,我们提出了一种以视频为中心的指令数据集,该数据集由数千个视频及其对应的详细描述与对话组成,重点强调时空推理与因果关系,为训练以对话为核心的视频理解系统提供了宝贵资源。初步定性实验揭示了该系统在广泛视频应用中的潜力,并为未来研究设立了标杆。我们的代码与数据可访问:https://github.com/OpenGVLab/Ask-Anything