Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering robustness to the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.
翻译:在将机器学习与统计方法应用于表格数据时,特征预处理仍扮演着关键角色。本文提出将核密度积分变换作为特征预处理步骤,该方法统一了两种主流特征预处理方法——线性最小-最大值缩放与分位数变换——作为其极限情况。我们证明,无需超参数调优,核密度积分变换即可作为两者的简易即插即用替代方案,对各自缺陷具备鲁棒性。此外,通过调整单个连续超参数,该方法通常能超越这两种方法的性能。最后,我们展示核密度变换可有效应用于统计数据分析,尤其在相关分析与单变量聚类中。