With the proliferation of ever more complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we have designed a unique confident data synthesis algorithm that introduces statistical confidence guarantees through a novel extension of the Conformal Prediction framework. We support our proposed algorithm with theoretical proofs and an extensive empirical evaluation of five benchmark datasets. To show our approach's versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance and non-separability, and privacy-sensitive data. In all trials, training sets extended with our confident synthesised data performed at least as well as the original, and frequently significantly improved Deep Learning performance by up to +65% F1-score.
翻译:随着日益复杂的深度学习架构的普及,数据合成成为解决数据饥渴模型需求的一项极具前景的技术。然而,可靠评估合成器模型输出质量仍是一个开放的研究问题,在高风险领域存在显著相关风险。为应对这一挑战,我们设计了一种独特的置信数据合成算法,通过共形预测框架的创新扩展引入统计置信度保证。我们通过理论证明和在五个基准数据集上的广泛实证评估来支持所提出的算法。为展示该方法在普遍现实挑战中的通用性,数据集经过精心选择,涵盖低样本量、类别不平衡与不可分性以及隐私敏感数据等不同困难特征。在所有实验中,使用置信合成数据扩展的训练集表现至少与原始数据集相当,且经常使深度学习性能显著提升,F1分数最高提升+65%。