This work proposes a method to evaluate the similarity between low-sample tabular data and synthetically generated data with a larger number of samples than the original. The technique is known to as data augmentation. However, significance values derived from non-parametric tests are questionable when the sample size is limited. Our approach uses a combination of geometry, topology, and robust statistics for hypothesis testing to evaluate the "validity" of generated data. We additionally contrast the findings with prominent global metric practices described in the literature for large sample size data.
翻译:本研究提出一种方法,用于评估小样本表格数据与样本量大于原始数据的合成生成数据之间的相似性。该技术通常被称为数据增强。然而,当样本量有限时,基于非参数检验得出的显著性值往往存在疑问。我们的方法结合几何学、拓扑学和稳健统计学的假设检验,以评估生成数据的“有效性”。此外,我们将研究结果与文献中针对大样本数据描述的常用全局度量方法进行了对比分析。