We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.
翻译:本文研究用于多视角Transformer的位置编码方法,这些Transformer处理来自一组位姿已知输入图像的图像块。我们寻求一种编码机制,要求能够唯一编码各图像块,支持具有多频相似度的SE(3)不变注意力,并能自适应底层场景的几何结构。研究发现,现有的多视角注意力编码方案(绝对或相对式)均无法满足上述要求,为此我们提出RayRoPE以填补这一空白。RayRoPE基于关联射线表示图像块位置,但利用沿射线预测的点而非射线方向来实现几何感知编码。为实现SE(3)不变性,RayRoPE通过计算查询帧的投影坐标来获得多频相似度。最后,针对沿射线"预测"的三维点可能不精确的问题,RayRoPE提出了一种在不确定性条件下解析计算期望位置编码的机制。我们在新视角合成和立体深度估计任务上验证了RayRoPE的有效性,结果表明其性能持续优于其他位置编码方案(例如在CO3D数据集上LPIPS指标相对提升15%)。实验还表明RayRoPE能够无缝融合RGB-D输入,相较于无法对此类信息进行位置编码的替代方案,可获得更显著的性能提升。