Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present GenMIND: a collection of generative models of normative regional volumetric features derived from structural brain imaging. GenMIND models are trained on real brain imaging regional volumetric measures from the iSTAGING consortium, which encompasses over 40,000 MRI scans across 13 studies, incorporating covariates such as age, sex, and race. Leveraging GenMIND, we produce and offer 18,000 synthetic samples spanning the adult lifespan (ages 22-90 years), alongside the model's capability to generate unlimited data. Experimental results indicate that samples generated from GenMIND agree with the distributions obtained from real data. Most importantly, the generated normative data significantly enhance the accuracy of downstream machine learning models on tasks such as disease classification. Data and models are available at: https://huggingface.co/spaces/rongguangw/GenMIND.
翻译:大型多样化医学数据集的获取常受限于隐私与数据共享限制。为实现疾病诊断、预后及精准医疗的机器学习技术应用,模型构建与优化需要大量数据。为帮助克服脑部MRI领域的此类限制,我们提出了GenMIND:一个基于结构性脑影像提取的规范区域体积特征的生成模型集合。GenMIND模型在iSTAGING联盟的真实脑影像区域体积测量数据上训练,该联盟涵盖13项研究中的超过40,000次MRI扫描,并纳入年龄、性别和种族等协变量。依托GenMIND,我们生成并提供了覆盖成年生命周期(22-90岁)的18,000个合成样本,同时该模型具备生成无限数据的能力。实验结果表明,GenMIND生成的样本与真实数据获得的分布特征一致。最重要的是,生成的规范数据显著提升了下游机器学习模型在疾病分类等任务中的准确性。数据与模型可通过以下网址获取:https://huggingface.co/spaces/rongguangw/GenMIND。