Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft's recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.
翻译:尽管视觉Transformer在许多方面落后于其模态同类,但它们为弥合序列建模与图像建模之间的差距提供了有趣的机会。然而,迄今为止,视觉Transformer在很大程度上受到计算效率低下以及对空间维度缺乏适当处理的双重制约。在本文中,我们引入了跨轴Transformer(CAT)。该模型受轴向Transformer和微软近期提出的Retentive Network的启发,在大幅减少处理图像所需浮点运算次数的同时,其收敛速度和准确性均优于所替代的视觉Transformer。