A Comparison of Modeling Preprocessing Techniques

This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal ``best'' method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.

翻译：本文比较了多种数据处理方法在结构化数据预测性能方面的表现。研究旨在识别并推荐适用于基于树的二分类模型的预处理方法，重点聚焦于极限梯度提升（XGBoost）模型。我们构建了三个具有不同结构、交互关系和复杂度的数据集，并辅以来自Lending Club的真实世界数据集。我们比较了特征选择、分类变量处理及缺失值插补的多种方法。通过所选方法间的相对比较（包括模型预测变异性）评估其性能。本文按三类预处理方法分组呈现，每部分包含通用性观察结论，每条观察结论均附有一项或多项优选方法推荐。在特征选择方法中，不推荐基于置换的特征重要性、正则化及XGBoost权重特征重要性。相关系数缩减方法同样表现欠佳。相比之下，基于增益的XGBoost重要性方法展现出最佳的一致性与性能水平。分类特征编码方法在不同数据集结构间表现出显著的性能差异。虽不存在普适的"最优"方法，但频率编码在复杂度最高的数据集（Lending Club）上表现最佳，而在所有合成数据集（即较简单数据集）上表现最差。最后，在插补方法中，缺失指示符插补在性能上占据主导地位，而树插补则导致模型表现极差且高度不稳定。