Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.

翻译：数据集压缩（DC）通过训练一个紧凑的合成数据集，使模型能够达到与完整数据训练相当的性能，其优先考虑实用性而非分布保真度。虽然DC通常因计算效率而被探索，但它同样对医疗数据民主化具有潜力，特别是与差分隐私结合时，合成数据可作为真实记录的安全替代品。然而，现有DC方法依赖于可微分的神经网络，限制了其与广泛使用的临床模型（如决策树和Cox回归）的兼容性。我们通过一种差分隐私的零阶优化框架来解决这一局限，该框架仅使用函数评估将DC扩展至不可微分模型。在涵盖分类和生存分析任务的六个数据集上的实证结果表明，所提方法生成的压缩数据集在保持模型实用性的同时提供了有效的差分隐私保障——从而为临床预测任务实现了模型无关的数据共享，且无需暴露敏感患者信息。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《数据创新：桥接传统方法与大型语言模型以应对罕见高影响事件》最新报告

专知会员服务

17+阅读 · 2月25日

【斯坦福博士论文】推动医学人工智能发展的数据高效算法

专知会员服务

28+阅读 · 2024年12月1日

【斯坦福博士论文】促进医疗人工智能的数据高效算法，123页pdf

专知会员服务

27+阅读 · 2024年9月5日