With the recent advances in video and 3D understanding, novel 4D spatio-temporal methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by unprojecting the 2D localization results of the sibling task Visual Queries with 2D Localization (VQ2D) into 3D predictions. Yet, we point out that the low number of camera poses caused by camera re-localization from previous VQ3D methods severally hinders their overall success rate. In this work, we formalize a pipeline (we dub EgoLoc) that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. Our approach involves estimating more robust camera poses and aggregating multi-view 3D displacements by leveraging the 2D detection confidence, which enhances the success rate of object queries and leads to a significant improvement in the VQ3D baseline performance. Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task. We provide a comprehensive empirical analysis of the VQ3D task and existing solutions, and highlight the remaining challenges in VQ3D. The code is available at https://github.com/Wayne-Mai/EgoLoc.
翻译:随着视频与三维理解技术的近期进展,融合这两类概念的创新四维时空方法应运而生。在此方向下,Ego4D情景记忆基准提出了基于三维定位的视觉查询任务(VQ3D)。给定一段第一人称视频片段与一个描绘查询物体的图像裁剪块,目标是定位该查询物体中心相对于查询帧相机姿态的三维坐标。当前方法通过将并行的二维定位视觉查询任务(VQ2D)的二维定位结果反投影为三维预测来解决VQ3D问题。然而,我们指出先前VQ3D方法中因相机重定位导致的相机姿态数量不足严重制约了其整体成功率。本研究形式化了一种融合第一人称视频中三维多视角几何与二维物体检索的流程(我们称之为EgoLoc)。该方法通过估计更鲁棒的相机姿态,并利用二维检测置信度聚合多视角三维位移,从而提升物体查询成功率,显著改进VQ3D基线性能。具体而言,我们的方法实现了高达87.12%的整体成功率,在VQ3D任务中确立了新的最优结果。我们对VQ3D任务及其现有解决方案进行了全面的实证分析,并揭示了该领域仍待解决的挑战。代码已开源至 https://github.com/Wayne-Mai/EgoLoc。