We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.
翻译:本文提出LongVPO,一种新颖的两阶段直接偏好优化框架,使短上下文视觉语言模型能够稳健地理解超长视频,且无需任何长视频标注。在第一阶段,我们通过将问题锚定到单个短视频片段、穿插干扰片段,并应用视觉相似性和问题特异性过滤来合成偏好三元组,以减轻位置偏差并确保明确的监督。我们还通过仅评估锚定片段来近似参考模型在长上下文上的评分,从而降低计算开销。在第二阶段,我们对长视频采用递归描述管道生成场景级元数据,然后利用大语言模型构建多片段推理查询及非偏好响应,通过多片段推理任务对齐模型的偏好。仅使用16K个合成示例且无需昂贵的人工标注,LongVPO在多个长视频基准测试中超越了当前最先进的开源模型,同时保持了强大的短视频性能(例如在MVBench上),为高效的长视频理解提供了一个可扩展的范式。