Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
翻译:Transformer是强大的视觉学习器,这在很大程度上归功于它们明显缺乏手动指定的先验知识。然而,这种灵活性在涉及多视图几何的任务中可能带来问题,因为3D形状和视角存在近乎无限的变化(需要灵活性),同时投影几何又具有精确性(遵循刚性法则)。为解决这一难题,我们提出了一种“轻触式”方法,引导视觉Transformer学习多视图几何,但在必要时允许其摆脱约束。我们通过使用极线引导Transformer的交叉注意力图来实现这一目标,对极线之外的注意力值进行惩罚,并鼓励沿这些线提高注意力,因为它们包含几何上合理的匹配。与以往方法不同,我们的方案在测试时不需要任何相机姿态信息。我们专注于姿态不变的对象实例检索,由于查询图像与检索图像之间视角差异大,标准Transformer网络在此任务中表现不佳。实验表明,我们的方法在对象检索方面优于现有最先进方法,且测试时无需姿态信息。