As massive medical data become available with an increasing number of scans, expanding classes, and varying sources, prevalent training paradigms -- where AI is trained with multiple passes over fixed, finite datasets -- face significant challenges. First, training AI all at once on such massive data is impractical as new scans/sources/classes continuously arrive. Second, training AI continuously on new scans/sources/classes can lead to catastrophic forgetting, where AI forgets old data as it learns new data, and vice versa. To address these two challenges, we propose an online learning method that enables training AI from massive medical data. Instead of repeatedly training AI on randomly selected data samples, our method identifies the most significant samples for the current AI model based on their data uniqueness and prediction uncertainty, then trains the AI on these selective data samples. Compared with prevalent training paradigms, our method not only improves data efficiency by enabling training on continual data streams, but also mitigates catastrophic forgetting by selectively training AI on significant data samples that might otherwise be forgotten, outperforming by 15% in Dice score for multi-organ and tumor segmentation. The code is available at https://github.com/MrGiovanni/OnlineLearning
翻译:随着扫描数量不断增加、类别持续扩展以及来源日益多样化,海量医学数据变得可用,而当前主流的训练范式——即人工智能在固定有限数据集上进行多轮训练——面临重大挑战。首先,由于新的扫描/来源/类别持续涌现,一次性利用如此海量的数据训练人工智能是不切实际的。其次,持续基于新的扫描/来源/类别训练人工智能可能导致灾难性遗忘,即人工智能在学习新数据时遗忘旧数据,反之亦然。为应对这两项挑战,我们提出一种在线学习方法,使人工智能能够从海量医学数据中进行训练。与在随机选取的数据样本上重复训练不同,我们的方法基于数据独特性和预测不确定性,识别对当前人工智能模型最具重要性的样本,随后在这些精选数据样本上训练人工智能。与主流训练范式相比,我们的方法不仅通过支持在连续数据流上训练提高了数据效率,还通过选择性地在可能被遗忘的重要数据样本上训练人工智能来缓解灾难性遗忘,在多器官与肿瘤分割任务中Dice分数提升达15%。代码发布于https://github.com/MrGiovanni/OnlineLearning