We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
翻译:我们提出视频循环(ViTL),一种两阶段长视频问答框架,通过首先使用低帧率浏览定位问题相关区间,然后在更高有效帧率下基于片段感知重新分配视觉标记来回答问题,从而保持固定标记预算,最终生成包含片段和最终选项的交错输出以实现直接归因。我们还引入了数据集名称,该数据集通过将基于描述的事件图转化为基于片段定位的多选问答,为每个问题配对真实时间片段及相关推理。ViTL采用端到端训练方式,通过结合定位的时间交并比与答案正确性的交错组相对目标函数,使得计算量不增加的情况下,信用能够从答案回流至片段。在固定标记预算下,ViTL在长视频问答与时间定位任务(如Charades-STA、ActivityNet-Captions)上实现了最高8.6%的性能提升,同时帧输入减少50%;消融实验表明片段感知标记重新分配策略持续优于均匀采样。数据集名称与ViTL共同为可扩展的长视频问答提供了可解释且计算高效的解决方案。