Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft's recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.
翻译:尽管在诸多方面落后于其模态上的同类模型,视觉Transformer为弥合序列建模与图像建模之间的鸿沟提供了有趣的契机。然而,迄今为止,视觉Transformer仍因计算效率低下以及对空间维度处理不当而受到较大制约。本文提出跨轴变压器(Cross-Axis Transformer,CAT)。该模型受轴向变压器(Axial Transformer)与微软近期提出的保留网络(Retentive Network)启发,在显著减少处理图像所需浮点运算次数的同时,其收敛速度与准确性均优于其所替代的视觉Transformer模型。