UT-Net: Combining U-Net and Transformer for Joint Optic Disc and Cup Segmentation and Glaucoma Detection

Glaucoma is a chronic visual disease that may cause permanent irreversible blindness. Measurement of the cup-to-disc ratio (CDR) plays a pivotal role in the detection of glaucoma in its early stage, preventing visual disparities. Therefore, accurate and automatic segmentation of optic disc (OD) and optic cup (OC) from retinal fundus images is a fundamental requirement. Existing CNN-based segmentation frameworks resort to building deep encoders with aggressive downsampling layers, which suffer from a general limitation on modeling explicit long-range dependency. To this end, in this paper, we propose a new segmentation pipeline, called UT-Net, availing the advantages of U-Net and transformer both in its encoding layer, followed by an attention-gated bilinear fusion scheme. In addition to this, we incorporate Multi-Head Contextual attention to enhance the regular self-attention used in traditional vision transformers. Thus low-level features along with global dependencies are captured in a shallow manner. Besides, we extract context information at multiple encoding layers for better exploration of receptive fields, and to aid the model to learn deep hierarchical representations. Finally, an enhanced mixing loss is proposed to tightly supervise the overall learning process. The proposed model has been implemented for joint OD and OC segmentation on three publicly available datasets: DRISHTI-GS, RIM-ONE R3, and REFUGE. Additionally, to validate our proposal, we have performed exhaustive experimentation on Glaucoma detection from all three datasets by measuring the Cup to Disc Ratio (CDR) value. Experimental results demonstrate the superiority of UT-Net as compared to the state-of-the-art methods.

翻译：青光眼是一种可能导致永久性不可逆失明的慢性视觉疾病。杯盘比（CDR）测量在青光眼早期检测中起着关键作用，可预防视力损伤。因此，从视网膜眼底图像中精准自动分割视盘（OD）和视杯（OC）成为基本需求。现有基于CNN的分割框架往往采用带有激进下采样层的深度编码器，但这类方法普遍存在显式建模长程依赖的局限性。为此，本文提出一种名为UT-Net的新型分割流水线，在编码层同时利用U-Net和Transformer的优势，并引入注意力门控双线性融合机制。此外，我们融入多头上下文注意力机制来增强传统视觉Transformer中的常规自注意力，从而以浅层方式捕获低级特征与全局依赖关系。同时，我们在多个编码层提取上下文信息以更好地探索感受野，并辅助模型学习深度层次化表征。最后，提出增强型混合损失函数以紧密监督整体学习过程。该模型已在三个公开数据集（DRISHTI-GS、RIM-ONE R3和REFUGE）上实现视盘与视杯联合分割。此外，为验证方法有效性，我们通过测量三个数据集的杯盘比（CDR）值对青光眼检测进行了详尽的实验分析。实验结果表明，UT-Net相较于现有最优方法具有显著优越性。