Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization

Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date. We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts. Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes. When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality. We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting

翻译：视网膜血管分割能够从眼底图像中提取具有临床价值的信息。由于人工标注繁琐，基于卷积神经网络（CNN）的算法已被开发出来。此类研究通常使用少量公开数据集进行训练和性能评估，存在过拟合风险。本文基于迄今发布的最大规模数据集，对文献中常用的多种架构与训练方案进行了严格基准测试。我们在公开的FIVES眼底图像数据集上训练并评估了五种已发表模型，该数据集在规模与质量上均超越先前数据集，并包含常见眼科疾病（糖尿病视网膜病变、年龄相关性黄斑变性、青光眼）的图像。我们比较了不同模型架构在多种损失函数、图像质量等级及眼科疾病条件下的性能，并评估其应对疾病引起领域偏移的泛化能力。在训练数据充足的情况下，基础架构（如U-Net）与更先进的架构表现相当，且多数架构能良好适应疾病引起的领域偏移。然而，我们发现图像质量是决定分割结果的关键因素。为优化分割性能，投入资源构建高质量标注数据集以训练标准架构，相比在较小数据集或低质量数据集上调试复杂架构能获得更优结果。我们通过临床相关性评估了架构改进的实际效用，从而为不同临床场景下的模型选择提供实践指导。