PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at https://github.com/farewellthree/PPLLaVA.

翻译：过去一年见证了基于视频的大语言模型的显著进展。然而，开发一个适用于短视频和长视频理解的统一模型这一挑战仍未解决。现有的大多数视频大语言模型无法处理长达数小时的视频，而为长视频定制的方法往往对短视频和图像效果不佳。本文指出，视频中的冗余内容是这一问题的关键所在。为此，我们提出了一种新颖的池化策略，该策略同时实现了令牌压缩和指令感知的视觉特征聚合。我们的模型被称为提示引导池化LLaVA，简称PPLLaVA。具体而言，PPLLaVA包含三个核心组件：基于CLIP的视觉-提示对齐模块，用于提取与用户指令相关的视觉信息；提示引导池化模块，使用卷积式池化将视觉序列压缩到任意尺度；以及为视觉对话中常见的长提示设计的片段上下文扩展模块。此外，我们的代码库还集成了最先进的视频直接偏好优化（DPO）和视觉交错训练技术。大量实验验证了我们模型的性能。凭借卓越的吞吐量和仅1024的视觉上下文长度，PPLLaVA作为一个视频大语言模型在图像基准测试中取得了更好的结果，同时在各种视频基准测试中均实现了最先进的性能，在从描述生成到多项选择题等一系列任务中表现出色，并能处理从数秒到数小时不同长度的视频。代码已发布于 https://github.com/farewellthree/PPLLaVA。