Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed.
翻译:时间语句定位(TSG)旨在从非裁剪视频中定位与自然语言查询语义对齐的时间片段。现有方法在传统TSG框架下,通过3D卷积网络或检测网络提取帧级特征或对象级特征,难以捕捉帧间的细微差异或建模核心人物/对象的时空行为。本文提出一种新视角,通过跟踪关键对象与活动来学习更细粒度的时空行为以解决TSG任务。具体而言,我们设计了一种新型时间语句跟踪网络(TSTNet),包含:(A)跨模态目标生成器,用于生成多模态模板与搜索空间,过滤对象与活动;(B)时间语句跟踪器,用于跟踪多模态目标以建模目标行为并预测与查询相关的时间片段。在具有挑战性的基准数据集Charades-STA和TACoS上进行了大量实验并与现有最优方法对比,我们的TSTNet以显著的实时速度实现了领先性能。