Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.
翻译:视频时序定位旨在根据给定的文本提示,在视频中定位相关的时间边界。近期研究聚焦于通过预测时间戳的下一个词元,使视频大语言模型能够执行视频时序定位任务。然而,仅依赖时序词元预测,视频大语言模型在视频中精确定位时间戳仍然具有挑战性。我们提出的TimeRefine通过两种方式应对这一挑战。首先,我们不直接预测起始和结束时间戳,而是将时序定位任务重新表述为时序精炼任务:模型首先进行粗略预测,然后通过预测与目标片段的偏移量来精炼这些预测。这一精炼过程重复多次,模型通过此过程逐步提升其时序定位精度。其次,为增强模型的时序感知能力,我们引入了一个辅助预测头,若预测片段偏离真实值越远,则对模型的惩罚越大,从而激励模型做出更接近且更准确的预测。我们的即插即用方法可集成到大多数基于大语言模型的时序定位方法中。实验结果表明,TimeRefine在ActivityNet和Charades-STA数据集上分别实现了3.6%和5.0%的mIoU提升。代码与预训练模型将公开发布。