Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.
翻译:现有的大多数视频片段检索方法依赖于帧级或片段级特征的时间序列,这些特征主要编码全局视觉和语义信息。然而,此类表示通常无法捕捉细粒度的物体语义与外观特征,而这些特征对于定位涉及特定实体及其交互的面向对象查询所描述的片段至关重要。特别是,物体层面的时序动态在很大程度上被忽视,这限制了现有方法在需要细粒度物体级推理场景中的有效性。为解决这一局限,我们提出了一种新颖的面向对象的片段检索框架。我们的方法首先通过场景图解析器提取与查询相关的物体,随后从视频帧生成场景图以表示这些物体及其相互关系。基于场景图,我们构建了编码丰富视觉与语义信息的物体级特征序列。这些序列通过关系轨迹变换器进行处理,该变换器对物体间随时间的时空相关性进行建模。通过显式捕捉物体级状态变化,我们的框架能够更精准地定位与面向对象查询相匹配的视频片段。我们在三个基准数据集上评估了本方法:Charades-STA、QVHighlights和TACoS。实验结果表明,我们的方法在所有基准测试中均优于现有最先进方法。