Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.
翻译:数据集压缩(DC)通过训练一个紧凑的合成数据集,使模型能够达到与完整数据训练相当的性能,其优先考虑实用性而非分布保真度。虽然DC通常因计算效率而被探索,但它同样对医疗数据民主化具有潜力,特别是与差分隐私结合时,合成数据可作为真实记录的安全替代品。然而,现有DC方法依赖于可微分的神经网络,限制了其与广泛使用的临床模型(如决策树和Cox回归)的兼容性。我们通过一种差分隐私的零阶优化框架来解决这一局限,该框架仅使用函数评估将DC扩展至不可微分模型。在涵盖分类和生存分析任务的六个数据集上的实证结果表明,所提方法生成的压缩数据集在保持模型实用性的同时提供了有效的差分隐私保障——从而为临床预测任务实现了模型无关的数据共享,且无需暴露敏感患者信息。