We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
翻译:本文提出视频循环(ViTL),一种两阶段长视频问答框架,通过先以低帧率浏览定位问题相关区间,再以更高有效帧率进行片段感知的视觉令牌重分配来保持固定令牌预算,最终输出包含时间片段和最终选项的交错结果以实现直接归因。我们还引入数据集,通过为每个问题配对其真实时间片段及相关推理,将基于描述的事件图转换为基于片段定位的多选问答。ViTL采用端到端训练,通过结合定位时态交并比与答案正确率的交错组相对目标函数,使计算量不增加的情况下实现从答案到片段的梯度回传。在固定令牌预算下,ViTL在长视频问答与时态定位任务(如Charades-STA、ActivityNet-Captions)上以50%更少的输入帧数实现最高8.6%的性能提升,消融实验表明片段感知令牌重分配策略持续优于均匀采样。数据集与ViTL共同为可扩展的长视频问答提供了兼具可解释性与计算效率的解决方案。