Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. The code is available at https://fedespu.github.io/Video-Panels.
翻译:近期视频-语言模型在长视频理解任务上取得了令人鼓舞的成果,但其表现仍落后于图像或短视频相关任务。这引发了通过引入新型模块和额外复杂性来提升视频-语言模型长上下文建模能力的广泛兴趣。本文另辟蹊径:并非利用有限数据对视频-语言模型进行微调,而是尝试最大化现有模型的性能。为此,我们提出了一种专为长视频理解设计的新型视觉提示策略。通过将多帧画面以面板形式组合成单张图像,我们有效权衡了空间细节与时间分辨率。该方法无需训练、无需参数、与模型无关,可无缝集成至现有视频-语言模型中。在涵盖多种模型架构、规模及上下文窗口的五个成熟基准上的大量实验验证了该方法的一致性。针对视频时长最长的TimeScope(Long)数据集,视频问答准确率最高提升19.4%。总体而言,我们的方法提升了长视频理解模型的性能标杆。代码已开源至https://fedespu.github.io/Video-Panels。