Owing to advancements in deep learning technology, Vision Transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, {the potential advantages of combining eagle vision with ViTs are explored. We summarize a Bi-Fovea Visual Interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes. A novel Bi-Fovea Self-Attention (BFSA) mechanism and Bi-Fovea Feedforward Network (BFFN) are proposed based on this structural design approach, which can be used to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, enabling networks to learn feature representations of targets in a coarse-to-fine manner. Furthermore, a Bionic Eagle Vision (BEV) block is designed as the basic building unit based on the BFSA mechanism and BFFN. By stacking BEV blocks, a unified and efficient family of pyramid backbone networks called Eagle Vision Transformers (EViTs) is developed. Experimental results show that EViTs exhibit highly competitive performance in various computer vision tasks, such as image classification, object detection and semantic segmentation. Compared with other approaches, EViTs have significant advantages, especially in terms of performance and computational efficiency. Code is available at https://github.com/nkusyl/EViT
翻译:得益于深度学习技术的进步,视觉Transformer(ViTs)在各种计算机视觉任务中展现出卓越的性能。然而,ViTs仍面临一些挑战,例如计算复杂度高以及缺乏理想的归纳偏置。为缓解这些问题,本文探索了将鹰眼视觉特性与ViTs结合的优势。受鹰眼独特的生理与视觉特征启发,我们总结出一种双凹窝视觉交互(BFVI)结构。基于该结构设计思路,我们提出了新颖的双凹窝自注意力(BFSA)机制与双凹窝前馈网络(BFFN),能够模拟生物视觉皮层的层次化并行信息处理模式,使网络能够以由粗到细的方式学习目标的特征表示。进一步地,基于BFSA机制与BFFN,我们设计了仿生鹰眼视觉(BEV)模块作为基础构建单元。通过堆叠BEV模块,构建了一个统一且高效的金字塔主干网络系列,称为鹰眼视觉Transformer(EViTs)。实验结果表明,EViTs在图像分类、目标检测和语义分割等多种计算机视觉任务中均表现出极具竞争力的性能。与其他方法相比,EViTs具有显著优势,尤其在性能与计算效率方面。代码公开于https://github.com/nkusyl/EViT。