Data sharing in the medical image analysis field has potential yet remains underappreciated. The aim is often to share datasets efficiently with other sites to train models effectively. One possible solution is to avoid transferring the entire dataset while still achieving similar model performance. Recent progress in data distillation within computer science offers promising prospects for sharing medical data efficiently without significantly compromising model effectiveness. However, it remains uncertain whether these methods would be applicable to medical imaging, since medical and natural images are distinct fields. Moreover, it is intriguing to consider what level of performance could be achieved with these methods. To answer these questions, we conduct investigations on a variety of leading data distillation methods, in different contexts of medical imaging. We evaluate the feasibility of these methods with extensive experiments in two aspects: 1) Assess the impact of data distillation across multiple datasets characterized by minor or great variations. 2) Explore the indicator to predict the distillation performance. Our extensive experiments across multiple medical datasets reveal that data distillation can significantly reduce dataset size while maintaining comparable model performance to that achieved with the full dataset, suggesting that a small, representative sample of images can serve as a reliable indicator of distillation success. This study demonstrates that data distillation is a viable method for efficient and secure medical data sharing, with the potential to facilitate enhanced collaborative research and clinical applications.
翻译:医学影像分析领域的数据共享具有潜力但仍未得到充分重视。其目标通常在于高效地与其他机构共享数据集以有效训练模型。一种可能的解决方案是避免传输整个数据集,同时仍能达到相似的模型性能。计算机科学领域数据蒸馏技术的最新进展为高效共享医学数据而不显著影响模型效果提供了前景。然而,由于医学影像与自然图像属于不同领域,这些方法是否适用于医学影像仍不确定。此外,这些方法能达到何种性能水平也值得探讨。为回答这些问题,我们在不同医学影像场景下对多种主流数据蒸馏方法进行了研究。通过两方面的大量实验评估这些方法的可行性:1)评估数据蒸馏在具有细微或显著差异的多数据集上的影响;2)探索预测蒸馏性能的指标。我们在多个医学数据集上的大量实验表明,数据蒸馏能显著减小数据集规模,同时保持与完整数据集相当的模型性能,这表明少量具有代表性的图像样本可作为蒸馏成功的可靠指标。本研究证明数据蒸馏是实现高效安全医学数据共享的可行方法,有望促进协同研究和临床应用的发展。