We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics (`what') and location (`where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences like point clouds or images at higher feature resolutions and achieves competitive performance across a range of tasks like point cloud part segmentation, semantic image segmentation and image classification.
翻译:我们提出了一种新颖的双向Transformer架构(BiXT),其计算成本和内存消耗与输入规模呈线性关系,但不会出现其他高效Transformer方法性能下降或仅适用于单一输入模态的问题。BiXT受Perceiver架构启发,但用高效的双向交叉注意力模块替代了迭代注意力机制——在该模块中,输入令牌和潜变量同时相互关注,并利用两者之间自然涌现的注意力对称性。这一方法突破了类Perceiver架构的关键瓶颈,使语义("是什么")和位置("在哪里")的加工与解释能够在多个层级上协同演进,从而可同时直接应用于密集预测和实例级任务。通过将效率与完整Transformer架构的通用性和性能相结合,BiXT能够以更高特征分辨率处理点云或图像等长序列,并在点云部件分割、语义图像分割和图像分类等一系列任务上取得具有竞争力的性能。