Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: https://inter-pose.github.io/.
翻译:图像间重叠区域极少或无重叠时的成对姿态估计是计算机视觉领域一个悬而未决的挑战。现有方法,即便是在大规模数据集上训练的方法,由于缺乏可识别的对应关系或视觉重叠,在此类场景中仍面临困难。受人类从多样化场景推断空间关系能力的启发,我们提出了一种新颖方法 InterPose,该方法利用预训练生成式视频模型中编码的丰富先验知识。我们提出使用视频模型在两个输入图像之间生成中间帧,从而有效地创建密集的视觉过渡,这显著简化了姿态估计问题。鉴于当前视频模型仍可能产生不合理的运动或不一致的几何结构,我们引入了一种自一致性评分,用于评估从采样视频中获得的姿态预测的一致性。我们证明了我们的方法在三种先进视频模型中均具有泛化能力,并在涵盖室内、室外和以物体为中心场景的四个多样化数据集上,相较于当前最先进的DUSt3R方法展现出持续的性能提升。我们的研究结果表明,通过利用在大量视频数据(此类数据比3D数据更易获取)上训练的大型生成模型,为改进姿态估计模型提供了一条有前景的途径。结果请参见我们的项目页面:https://inter-pose.github.io/。