SynFundus: A synthetic fundus images dataset with millions of samples and multi-disease annotations

In the field of medical imaging, there are seldom large-scale public datasets with high-quality annotations due to data privacy and annotation cost. To address this issue, we release SynFundus-1M, a high-quality synthetic dataset containing over \textbf{1 million} fundus images w.r.t. 11 disease types. Moreover, we intentionally diversify the readability of the images and accordingly provide 4 types of the quality score for each image. To the best of our knowledge, SynFundus-1M is currently the largest fundus dataset with the most sophisticated annotations. All the images are generated by a Denoising Diffusion Probabilistic Model, named SynFundus-Generator. Trained with over 1.3 million private fundus images, our SynFundus-Generator achieves significant superior performance in generating fundus images compared to some recent related works. Furthermore, we blend some synthetic images from SynFundus-1M with real fundus images, and ophthalmologists can hardly distinguish the synthetic images from real ones. Through extensive experiments, we demonstrate that both convolutional neural networs (CNN) and Vision Transformer (ViT) can benefit from SynFundus-1M by pretraining or training directly. Compared to datasets like ImageNet or EyePACS, models trained on SynFundus-1M not only achieve better performance but also faster convergence on various downstream tasks.

翻译：在医学影像领域，由于数据隐私和标注成本限制，鲜有具备高质量标注的大规模公开数据集。为解决这一问题，我们发布了SynFundus-1M这一高质量合成数据集，包含超过\textbf{100万张}涵盖11种疾病类型的眼底图像。此外，我们特意使图像的易读性呈现多样性，并为每张图像提供4种类型的质量评分。据我们所知，SynFundus-1M是目前规模最大且标注最精细的眼底数据集。所有图像均由名为SynFundus-Generator的去噪扩散概率模型生成。该生成器基于超过130万张私有眼底图像训练，在眼底图像生成任务上显著优于近期相关研究。更进一步，我们将SynFundus-1M中的部分合成图像与真实眼底图像混合后，眼科医生难以区分合成图像与真实图像。通过大量实验证明，无论是卷积神经网络（CNN）还是Vision Transformer（ViT），都能通过预训练或直接训练的方式从SynFundus-1M中获益。与ImageNet或EyePACS等数据集相比，基于SynFundus-1M训练的模型不仅能在多种下游任务中取得更优性能，还能实现更快的收敛速度。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日