As transformers are equivariant to the permutation of input tokens, encoding the positional information of tokens is necessary for many tasks. However, since existing positional encoding schemes have been initially designed for NLP tasks, their suitability for vision tasks, which typically exhibit different structural properties in their data, is questionable. We argue that existing positional encoding schemes are suboptimal for 3D vision tasks, as they do not respect their underlying 3D geometric structure. Based on this hypothesis, we propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation determined by the geometric relationship between queries and key-value pairs. By evaluating on multiple novel view synthesis (NVS) datasets in the sparse wide-baseline multi-view setting, we show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models without any additional learned parameters and only minor computational overhead.
翻译:由于变换器对输入令牌的排列具有等变性,因此在许多任务中需要对令牌的位置信息进行编码。然而,由于现有位置编码方案最初是为NLP任务设计的,它们对于视觉任务(通常数据具有不同的结构特性)的适用性存疑。我们认为,现有位置编码方案不适用于3D视觉任务,因为它们未尊重其底层三维几何结构。基于这一假设,我们提出一种几何感知注意力机制,该机制将令牌的几何结构编码为由查询与键值对之间的几何关系确定的相对变换。通过在稀疏宽基线多视角设置下的多个新视角合成(NVS)数据集上进行评估,我们表明,所提出的注意力机制——称为几何变换注意力(GTA)——在不增加任何额外学习参数且仅产生极小计算开销的情况下,提升了基于变换器的先进NVS模型的学习效率和性能。