Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at https://pllava.github.io/
翻译:视觉语言预训练显著提升了图像语言应用的性能。然而,视频相关任务的预训练过程需要极其庞大的计算和数据资源,这阻碍了视频语言模型的发展。本文研究了一种直接、高效且资源节约的方法,用于将现有图像语言预训练模型适配到密集视频理解中。我们的初步实验表明,直接使用多帧作为输入在视频数据集上微调预训练图像语言模型会导致性能饱和甚至下降。进一步研究发现,这主要归因于学习到的高范数视觉特征的偏差。受此发现启发,我们提出了一种简单但有效的池化策略,以平滑时间维度上的特征分布,从而减少极端特征的支配性影响。新模型称为Pooling LLaVA,简称PLLaVA。PLLaVA在视频问答和描述任务的现代基准数据集上均实现了新的最优性能。值得注意的是,在近期流行的VideoChatGPT基准上,PLLaVA在五个评估维度的平均得分为3.48(满分5分),比先前GPT4V (IG-VLM) 的最优结果高出9%。在最新多选基准MVBench上,PLLaVA在20个子任务上的平均准确率达到58.1%,比GPT4V (IG-VLM) 高出14.5%。代码已发布于 https://pllava.github.io/