Video moment localization, also known as video moment retrieval, aiming to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. We also review the datasets available for video moment localization and group results of related work. In addition, we discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
翻译:视频时刻定位,又称视频时刻检索,旨在根据给定的自然语言查询,在视频中搜索目标片段。与目标动作预定义的时间动作定位任务不同,视频时刻检索可查询任意复杂活动。在本综述中,我们旨在全面回顾现有视频时刻定位技术,包括监督式、弱监督式及无监督式方法。同时,我们综述了可用于视频时刻定位的数据集,并归类整理了相关工作的实验结果。此外,我们讨论了该领域未来有前景的研究方向,特别是大规模数据集与可解释的视频时刻定位模型。