SynFundus: A synthetic fundus images dataset with millions of samples and multi-disease annotations

In the field of medical imaging, there are seldom large-scale public datasets with high-quality annotations due to data privacy and annotation cost. To address this issue, we release SynFundus-1M, a high-quality synthetic dataset containing over \textbf{1 million} fundus images w.r.t. 11 disease types. Moreover, we intentionally diversify the readability of the images and accordingly provide 4 types of the quality score for each image. To the best of our knowledge, SynFundus-1M is currently the largest fundus dataset with the most sophisticated annotations. All the images are generated by a Denoising Diffusion Probabilistic Model, named SynFundus-Generator. Trained with over 1.3 million private fundus images, our SynFundus-Generator achieves significant superior performance in generating fundus images compared to some recent related works. Furthermore, we blend some synthetic images from SynFundus-1M with real fundus images, and ophthalmologists can hardly distinguish the synthetic images from real ones. Through extensive experiments, we demonstrate that both convolutional neural networs (CNN) and Vision Transformer (ViT) can benefit from SynFundus-1M by pretraining or training directly. Compared to datasets like ImageNet or EyePACS, models trained on SynFundus-1M not only achieve better performance but also faster convergence on various downstream tasks.

翻译：在医学影像领域，由于数据隐私和标注成本等问题，很少存在大规模且具有高质量标注的公开数据集。为解决这一问题，我们发布了SynFundus-1M，这是一个包含超过**100万**张眼底图像、涵盖11种疾病类型的高质量合成数据集。此外，我们有意使图像的可读性多样化，并为每张图像提供了4种质量评分。据我们所知，SynFundus-1M是目前规模最大且标注最为精细的眼底数据集。所有图像均由名为SynFundus-Generator的去噪扩散概率模型生成。该模型基于超过130万张私有眼底图像进行训练，在生成眼底图像方面，其性能显著优于近期相关研究工作。进一步地，我们将SynFundus-1M中的部分合成图像与真实眼底图像混合，眼科医生几乎无法区分合成图像与真实图像。通过大量实验，我们证明无论是卷积神经网络（CNN）还是Vision Transformer（ViT），都能通过预训练或直接训练的方式从SynFundus-1M中获益。与ImageNet或EyePACS等数据集相比，基于SynFundus-1M训练的模型不仅在各类下游任务中取得了更优的性能，而且收敛速度也更快。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日