Accurate fovea localization is essential for analyzing retinal diseases to prevent irreversible vision loss. While current deep learning-based methods outperform traditional ones, they still face challenges such as the lack of local anatomical landmarks around the fovea, the inability to robustly handle diseased retinal images, and the variations in image conditions. In this paper, we propose a novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information, focusing more on features distributed along blood vessels and significantly reducing computational costs by decreasing token numbers. Our extensive experiments show that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Furthermore, we demonstrate that the DSFN is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:精确的黄斑中心定位对于分析视网膜疾病、预防不可逆视力丧失至关重要。尽管当前基于深度学习的方法优于传统方法,但仍面临黄斑周围局部解剖标志缺失、无法稳健处理病变视网膜图像以及图像条件变化等挑战。本文提出一种名为双流焦点网(DSFN)的新型Transformer架构用于多线索融合。该架构通过显式利用视网膜与血管分布建立长程连接和全局特征,实现鲁棒性黄斑中心定位。我们在双流编码器中引入空间注意力机制,提取并融合自学习解剖信息,重点关注沿血管分布的特征,并通过减少令牌数量显著降低计算成本。大量实验表明,所提架构在两个公开数据集和一个大规模私有数据集上均达到最优性能。此外,我们证明DSFN在正常和病变视网膜图像上均具有更强的鲁棒性,并在跨数据集实验中展现出更优的泛化能力。