Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
翻译:合成数据生成已被证实在数据稀缺或质量低下的背景下,能够有效提升模型性能与鲁棒性。利用数据估值框架统计识别有益与有害的观测样本,本文提出了一种基于难度表征、仅生成高价值训练样本的新型数据增强流程。我们首先通过在真实数据上的基准测试证明,基于Shapley值的数据估值方法在难度表征任务中与基于学习的方法表现相当,同时具备显著的理论与计算优势。随后,我们展示在模拟数据及大规模信贷违约预测任务中,基于最困难样本训练的合成数据生成器优于非定向数据增强方法。特别地,本方法提升了样本外预测的质量,并且在计算效率上较非定向方法更具优势。