Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering protection from the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.
翻译:在机器学习与统计方法应用于表格数据时,特征预处理仍发挥着关键作用。本文提出将核密度积分变换作为特征预处理步骤。我们的方法将两种主流特征预处理方法——线性最小-最大缩放与分位数变换——作为极限情形纳入其中。研究表明,无需超参数调优,核密度积分变换即可作为两者的简单即插即用替代方案,有效规避各自缺陷。此外,通过对单一连续超参数进行调优,该方法常能显著优于上述两种方法。最后,我们证明核密度变换可有效应用于统计数据分析,尤其在相关分析与单变量聚类任务中展现出显著优势。