Progressive trajectory matching for medical dataset distillation

It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.

翻译：共享医学图像数据集因隐私问题至关重要但颇具挑战，这阻碍了基础模型的构建和知识迁移。本文提出了一种新颖的数据集蒸馏方法，将原始医学图像数据集压缩为合成数据集，在无需访问原始数据集的前提下保留可用于构建分析模型的有效信息。现有方法仅通过随机匹配由完整真实数据集训练的模型参数训练轨迹部分片段来处理自然图像。然而，通过大量医学图像数据集实验发现，该训练过程极不稳定且蒸馏效果不佳。为解决这些障碍，我们设计了一种新颖的渐进式轨迹匹配策略，以提升医学图像数据集蒸馏的训练稳定性。此外，观察到稳定性提升会抑制合成数据集的多样性和最终性能提升。为此，我们提出动态重叠缓解模块，通过动态消除不同图像间的重叠并重新训练部分合成图像以实现更优收敛，从而提升合成数据集多样性。最终，我们建立了涵盖多种模态和配置的医学图像数据集蒸馏新基准，以促进公平评估。实验验证，我们的方法平均较先前最优方法提升8.33%的性能，在每类图像数（ipc）为2时提升11.7%。相关代码与基准数据集将予以公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

IJCAI2022《对抗序列决策》教程，164页ppt

专知会员服务

48+阅读 · 2022年7月27日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

（CVPR2021）基于结构保持的弱监督目标定位

专知会员服务

21+阅读 · 2021年5月1日