VideoPro: Adaptive Program Reasoning for Long Video Understanding

Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.

翻译：大型语言模型（LLM）在生成视觉任务的程序工作流方面展现出潜力。然而，先前方法通常依赖闭源模型，缺乏系统性推理能力，且在长视频问答（videoQA）任务上表现不佳。为应对这些挑战，我们提出了FS-VisPR框架，一种自适应视觉程序推理方法，能够针对简单查询进行快速推理，对复杂查询进行慢速推理。首先，我们设计了高效的视觉模块（如关键片段检索与字幕检索）以支持长视频任务。随后，我们利用一个强大的LLM构建了多样且高质量的快速-慢速推理数据集，用以对齐开源语言模型生成视觉程序工作流的能力，形成FS-LLM。接着，我们设计了基于FS-LLM的快速-慢速推理框架：简单查询直接由VideoLLM处理，而复杂查询则触发视觉程序推理，其设计灵感来源于类人的推理过程。在此过程中，低置信度的快速推理答案将触发第二阶段的慢速推理流程；若程序执行失败，则启动回退至快速推理的机制。此外，我们通过在训练和推理阶段进行参数搜索来优化视觉程序。通过调整程序内视觉模块的参数，可生成多个变体：在训练阶段，选择能够产生正确答案的程序；在推理阶段，则采用置信度最高的结果对应的程序。实验表明，FS-VisPR显著提升了视觉程序工作流的效率与可靠性。其在LVBench数据集上取得了50.4%的准确率，超越了GPT-4o，达到了与Qwen2.5VL-72B在VideoMME上相当的性能水平。