Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
翻译:现有长视频理解方法主要关注仅持续数十秒的视频,对处理更长视频的技术探索有限。更长视频中增加的帧数带来两大挑战:关键信息定位困难以及长程推理难以实现。为此,我们提出DrVideo,一个基于文档检索的长视频理解系统。我们的核心思想是将长视频理解问题转化为长文档理解任务,从而有效利用大语言模型的能力。具体而言,DrVideo将长视频转化为基于文本的长文档,以初步检索关键帧并增强这些帧的信息,以此作为系统起点。随后采用基于智能体的迭代循环,持续搜索缺失信息、增强相关数据,并在收集到足够问题相关信息后以思维链方式提供最终预测。在长视频基准测试上的大量实验证实了我们方法的有效性。DrVideo在EgoSchema基准(3分钟)上以+3.8准确率超越现有最优方法,在MovieChat-1K中断模式下提升+17.9,在MovieChat-1K全局模式(10分钟)提升+38.0,在LLama-Vid QA数据集(超过60分钟)提升+30.2。