With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.
翻译:随着视频骨干架构的最新进展,结合大型语言模型(LLM)的显著成就,对长达数十分钟的长视频进行分析已变得可行且日益普遍。然而,视频序列固有的冗余性对当前最先进的模型构成了重大挑战。这些挑战主要源于两个方面:1) 在内存限制内高效地融入更多帧;2) 从海量输入数据中提取判别性信息。本文提出了一种新颖的端到端长视频理解方案,其中包括一个基于信息密度的自适应视频采样器(AVS)和一个基于自动编码器的时空视频压缩器(SVC),并与多模态大语言模型(MLLM)集成。我们提出的系统具有两大优势:它能自适应且有效地从不同时长的视频序列中捕获关键信息,并在保持关键判别信息的同时实现高压缩率。所提出的框架在多个基准测试中展现出良好的性能,在长视频理解任务和标准视频理解基准测试中均表现出色。这些结果凸显了我们方法的通用性和有效性,尤其是在处理长视频序列的复杂性方面。