We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.
翻译:我们提出并研究了一种用于合成表格数据生成的极简主义方法。该模型由一个极简的无监督稀疏主成分分析编码器(包含用于处理非线性的条件聚类步骤或对数变换)和一个在结构化数据回归与分类任务中达到最先进水平的XGBoost解码器构成。我们在多个低维玩具场景中研究并与(变分)自编码器的方法进行对比,以推导必要的理论直觉。该框架被应用于模拟高维信用评分数据,该数据与现实金融应用场景相平行。我们将该方法应用于鲁棒性测试以展示实际应用案例。案例研究结果表明,该方法为模型鲁棒性测试提供了原始数据扰动和分位数扰动之外的替代方案。我们证明该方法简洁明了,全程保证可解释性,无需额外调参,并能提供独特的优势。