Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios. In this paper, we present our vision for multimodal and versatile video understanding and propose a prototype system, \system. Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit and employs various Video Foundation Models (ViFMs) to annotate their properties e.g., appearance, motion, \etc. All the detected tracklets are stored in a database and interact with the user through a database manager. We have conducted extensive case studies on different types of in-the-wild videos, which demonstrates the effectiveness of our method in answering various video-related problems. Our project is available at https://www.wangjunke.info/ChatVideo/
翻译:现有深度视频模型受限于特定任务、固定输入输出空间及较差的泛化能力,难以在真实场景中部署。本文提出我们对多模态通用视频理解的构想,并构建了一个原型系统\system。该系统基于轨迹片段中心范式,将轨迹片段作为基本视频单元,利用多种视频基础模型(ViFMs)标注其属性,如外观、运动等。所有检测到的轨迹片段存储于数据库中,并通过数据库管理器与用户交互。我们针对不同类型的野外视频开展了大量案例研究,证明了该方法在回答各类视频相关问题时的有效性。项目地址为https://www.wangjunke.info/ChatVideo/