Accurate localization of fovea is one of the primary steps in analyzing retinal diseases since it helps prevent irreversible vision loss. Although current deep learning-based methods achieve better performance than traditional methods, there still remain challenges such as utilizing anatomical landmarks insufficiently, sensitivity to diseased retinal images and various image conditions. In this paper, we propose a novel transformer-based architecture (Bilateral-Fuser) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder for extracting and fusing self-learned anatomical information. This design focuses more on features distributed along blood vessels and significantly decreases computational costs by reducing token numbers. Our comprehensive experiments show that the proposed architecture achieves state-of-the-art performance on two public and one large-scale private datasets. We also present that the Bilateral-Fuser is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:中心凹的精确是分析视网膜疾病的首要步骤之一,因为有助于预防不可逆的视力丧失。尽管当前基于深度学习的方法比传统方法取得了更好的性能,但仍存在诸如解剖标志利用不充分、对病变视网膜图像及各类图像条件敏感等挑战。本文提出一种基于Transformer的新型架构(Bilateral-Fuser),用于多线索融合。该架构通过显式利用视网膜与血管分布的长程连接和全局特征,实现稳健的中心凹定位。我们引入一种空间注意力机制于双流编码器中,用于提取并融合自学习的解剖信息。该设计更关注沿血管分布的特征,并通过减少令牌数量显著降低计算成本。综合实验表明,所提架构在两个公开数据集和一个大规模私有数据集上均达到最先进性能。我们还证明,Bilateral-Fuser在正常和病变视网膜图像上具有更高鲁棒性,并在跨数据集实验中展现出更强的泛化能力。