The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.
翻译:摘要:教学视频中的时间答案定位任务(TAGV)旨在定位响应自然语言查询的精确视频片段,对于直接视频答案检索日益重要。由于需要理解语义复杂的问题并应对未修剪视频与短目标时刻之间的显著长度不匹配,该任务仍然具有挑战性。现有方法常受限于对无关内容的敏感性或视觉推理能力不足。为解决这些局限,我们提出候选感知因果推理(CACR)框架。该方法首先采用基于视觉-语言预训练的候选选择(VBCS)算法高效生成K个候选片段,然后通过集成拒绝奖励机制的时间逻辑推理模块进行推理,并利用组相对策略优化(GRPO)进行优化以增强鲁棒性。在六个基准上的广泛实验表明,我们的方法在平均交并比(mIoU)上达到最先进性能,为长视频中基于推理的检索提供了新视角。