Accurate fovea localization is essential for analyzing retinal diseases to prevent irreversible vision loss. While current deep learning-based methods outperform traditional ones, they still face challenges such as the lack of local anatomical landmarks around the fovea, the inability to robustly handle diseased retinal images, and the variations in image conditions. In this paper, we propose a novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information, focusing more on features distributed along blood vessels and significantly reducing computational costs by decreasing token numbers. Our extensive experiments show that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Furthermore, we demonstrate that the DSFN is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:精确的黄斑中心凹定位对于分析视网膜疾病、预防不可逆的视力丧失至关重要。虽然当前基于深度学习的方法优于传统方法,但它们仍面临一些挑战,例如黄斑周围缺乏局部解剖标志、无法鲁棒地处理病变视网膜图像以及图像条件的变化。本文提出了一种新颖的基于Transformer的架构,称为DualStreamFoveaNet(DSFN),用于多线索融合。该架构利用视网膜和血管分布,显式地结合了长程连接和全局特征,以实现鲁棒的黄斑中心凹定位。我们在双流编码器中引入了一种空间注意力机制,用于提取和融合自学习的解剖信息,更多地关注沿血管分布的特征,并通过减少令牌数量显著降低了计算成本。我们的大量实验表明,所提出的架构在两个公共数据集和一个大规模私有数据集上实现了最先进的性能。此外,我们证明了DSFN在正常和病变视网膜图像上都具有更强的鲁棒性,并且在跨数据集实验中具有更好的泛化能力。