Accurate localization of the fovea is a crucial initial step in analyzing retinal diseases since it helps prevent irreversible vision loss. Although current deep learning-based methods achieve better performance than traditional methods, they still face challenges such as inadequate utilization of anatomical landmarks, sensitivity to diseased retinal images, and various image conditions. In this paper, we propose a novel transformer-based architecture (Bilateral-Fuser) for multi-cue fusion. The Bilateral-Fuser explicitly incorporates long-range connections and global features using retina and vessel distributions to achieve robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information. This design focuses more on features distributed along blood vessels and significantly reduces computational costs by reducing token numbers. Our comprehensive experiments demonstrate that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Moreover, we show that the Bilateral-Fuser is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:精确的中心凹定位是分析视网膜疾病的关键初始步骤,有助于预防不可逆的视力丧失。尽管当前基于深度学习的方法相较于传统方法取得了更优性能,但仍面临解剖标志利用不充分、对视网膜病变图像敏感以及多种图像条件干扰等挑战。本文提出了一种基于Transformer的新型多线索融合架构(Bilateral-Fuser)。该架构通过视网膜与血管分布显式结合长程连接与全局特征,实现鲁棒的中心凹定位。我们在双流编码器中引入空间注意力机制,用于提取并融合自学习解剖信息。该设计更关注沿血管分布的特征,并通过减少令牌数量显著降低计算成本。综合实验表明,所提架构在两个公开数据集和一个大规模私有数据集上达到了最先进性能。此外,我们验证了Bilateral-Fuser在正常和病变视网膜图像上均具有更强鲁棒性,且在跨数据集实验中展现出更优的泛化能力。