Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
翻译:尽管基于强化学习(RL)的大语言模型(LLM)视频推理方法已取得进展,但数据收集与微调仍是重大挑战。这些方法通常依赖于大规模监督微调(SFT),需要大量视频数据及冗长的思维链(CoT)标注,导致成本高昂且难以扩展。为解决此问题,我们提出Video-RTS,一种通过结合数据高效的RL与视频自适应的测试时扩展(TTS)策略,以显著提升数据效率来增强视频推理能力的新方法。基于对数据扩展规律的观察,我们跳过了资源密集的SFT步骤,采用基于输出的奖励进行高效的纯RL训练,无需额外标注或大量微调。此外,为更高效地利用计算资源,我们引入一种稀疏到稠密的视频TTS策略,通过基于输出一致性迭代增加帧数来改进推理。我们在多个视频推理基准上验证了所提方法,结果表明Video-RTS仅使用3.6%的训练样本,其准确率即超越现有视频推理模型2.4%。具体而言,在近期具有挑战性的视频推理基准Video-Holmes上,Video-RTS实现了4.2%的性能提升。值得注意的是,我们的纯RL训练与自适应视频TTS策略具备互补优势,共同促成了Video-RTS强大的推理性能。