Data augmentation is essential when applying Machine Learning in small-data regimes. It generates new samples following the observed data distribution while increasing their diversity and variability to help researchers and practitioners improve their models' robustness and, thus, deploy them in the real world. Nevertheless, its usage in tabular data still needs to be improved, as prior knowledge about the underlying data mechanism is seldom considered, limiting the fidelity and diversity of the generated data. Causal data augmentation strategies have been pointed out as a solution to handle these challenges by relying on conditional independence encoded in a causal graph. In this context, this paper experimentally analyzed the ADMG causal augmentation method considering different settings to support researchers and practitioners in understanding under which conditions prior knowledge helps generate new data points and, consequently, enhances the robustness of their models. The results highlighted that the studied method (a) is independent of the underlying model mechanism, (b) requires a minimal number of observations that may be challenging in a small-data regime to improve an ML model's accuracy, (c) propagates outliers to the augmented set degrading the performance of the model, and (d) is sensitive to its hyperparameter's value.
翻译:在少数据场景下应用机器学习时,数据增强至关重要。该方法在遵循观测数据分布的同时生成新样本,通过提升样本多样性和变异性,帮助研究者和从业者增强模型鲁棒性,从而将其部署至现实世界。然而,其在表格数据中的应用仍有待改进——由于很少考虑关于底层数据机制的先验知识,生成数据的保真度和多样性受到限制。因果数据增强策略通过依赖因果图中编码的条件独立性,被视为应对这些挑战的解决方案。在此背景下,本文通过实验分析了不同设置下的ADMG因果增强方法,旨在帮助研究者与从业者理解在何种条件下先验知识能辅助生成新数据点,进而提升模型鲁棒性。结果表明,该方法:(a) 与底层模型机制无关;(b) 所需最小观测样本量可能在少数据场景下难以获取以提升ML模型准确率;(c) 会将异常值传播至增强集,导致模型性能下降;(d) 对超参数取值敏感。