Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.
翻译:视频时间定位(VTG)任务以未剪辑视频和自然语言查询作为输入,旨在定位与查询最匹配的时间段。现有方法依赖需要昂贵人工标注的大规模任务专用数据集。我们提出EvoGround框架,包含两个耦合的自演进智能体(提案生成器与求解器),无需任何人工标注数据即可从原始视频中学习时间定位。提案生成器从原始视频生成查询-时间段对,求解器则学习定位这些时间段并反馈信号优化提案生成器。通过这种自我强化的强化学习循环,两个智能体从同一基础网络初始化,在迭代中相互提升。基于2,500个未标注视频的训练,EvoGround在多个VTG基准测试中达到或超越全监督模型,同时无需人工标注即成为最先进的细粒度视频描述生成器。