TimeRefine: Temporal Grounding with Time Refining Video LLM

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.

翻译：视频时序定位旨在根据给定的文本提示，在视频中定位相关的时间边界。近期研究聚焦于通过预测时间戳的下一个词元，使视频大语言模型能够执行视频时序定位任务。然而，仅依赖时序词元预测，视频大语言模型在视频中精确定位时间戳仍然具有挑战性。我们提出的TimeRefine通过两种方式应对这一挑战。首先，我们不直接预测起始和结束时间戳，而是将时序定位任务重新表述为时序精炼任务：模型首先进行粗略预测，然后通过预测与目标片段的偏移量来精炼这些预测。这一精炼过程重复多次，模型通过此过程逐步提升其时序定位精度。其次，为增强模型的时序感知能力，我们引入了一个辅助预测头，若预测片段偏离真实值越远，则对模型的惩罚越大，从而激励模型做出更接近且更准确的预测。我们的即插即用方法可集成到大多数基于大语言模型的时序定位方法中。实验结果表明，TimeRefine在ActivityNet和Charades-STA数据集上分别实现了3.6%和5.0%的mIoU提升。代码与预训练模型将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日