While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
翻译:尽管多模态大语言模型(MLLMs)已在广泛任务中展现出卓越的成功,但长视频理解仍然是一个重大挑战。在本研究中,我们专注于MLLMs的视频理解任务。该任务具有挑战性,因为处理完整的RGB帧流在计算上难以实现且高度冗余,因为自注意力机制具有随序列长度平方增长的复杂度。本文提出ReMoRa,一种直接对视频的压缩表征进行操作处理的视频MLLM。模型保留一组稀疏的RGB关键帧以表征外观信息,同时将时序动态编码为运动表征,从而无需处理连续的RGB帧。这些运动表征作为光流的紧凑代理,可在不解码完整帧的情况下捕捉时序动态。为了改善基于块的运动信息中存在的噪声和低保真度问题,我们引入了一个模块来去噪并生成细粒度的运动表征。此外,我们的模型以线性于序列长度的方式对这些特征进行压缩。我们通过在全面的长视频理解基准测试套件上进行大量实验,证明了ReMoRa的有效性。ReMoRa在包括LongVideoBench、NExT-QA和MLVU在内的多个具有挑战性的基准测试中均优于基线方法。