Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
翻译:视频时序句子定位旨在根据给定句子查询,从未剪辑视频中检索最相关的片段。然而,现有方法普遍存在以下局限:(1) 仅关注帧级或对象级视觉表征学习及对应的关联推理,未能实现二者的有机融合;(2) 忽视利用丰富的语义上下文进一步提升查询推理效果。针对上述问题,本文提出新型层次化视觉-语义感知推理网络,实现从对象级到帧级的视觉与语义联合查询推理。具体而言,我们设计了新颖的图记忆机制进行视觉-语义查询推理:在视觉推理方面,构建视觉图记忆以利用视频的视觉信息;在语义推理方面,引入语义图记忆显式利用视频对象类别与属性中包含的语义知识,并在语义空间执行关联推理。在三个数据集上的实验表明,所提方法达到了新的最优性能。