Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our experiments demonstrate that our approach outperforms prior VQL methods by 20% accuracy while obtaining a 10x improvement in inference speed. VQLoC is also the top entry on the Ego4D VQ2D challenge leaderboard. Project page: https://hwjiang1510.github.io/VQLoC/
翻译:针对长时间自我中心视频中的视觉查询定位任务,需要时空维度的搜索与定位,这对于构建情节记忆系统至关重要。现有方法采用复杂的多阶段流水线,通过结合成熟的物体检测与跟踪技术实现VQL。然而,每个阶段独立训练导致流水线复杂度较高,推理速度缓慢。我们提出VQLoC——一种新颖的、支持端到端训练的单阶段VQL框架。核心思路是先建立对查询-视频关系的整体理解,再以单次推理方式完成时空定位。具体而言,通过联合建模查询与每帧视频帧之间的对应关系,以及相邻视频帧间的帧间对应关系,构建查询-视频关系。实验表明,本方法在准确率上较现有VQL方法提升20%,推理速度提升10倍。VQLoC目前位居Ego4D VQ2D挑战排行榜首。项目主页:https://hwjiang1510.github.io/VQLoC/