A Comparison of Modeling Preprocessing Techniques

This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal "best" method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.

翻译：本文比较了各种数据处理方法在结构化数据预测性能方面的表现。研究旨在识别并推荐适用于基于树的二分类模型的预处理方法，重点关注极限梯度提升（XGBoost）模型。我们构建了三种具有不同结构、交互作用及复杂度的数据集，并辅以来自Lending Club的真实数据集。实验对比了特征选择、分类变量处理及缺失值插补的多种方法。通过所选方法间的相对比较（包括模型预测变异性）评估性能。本文按三大预处理方法组进行阐述，每部分包含概括性观察结论，并针对每个观察结果推荐一种或多种优选方法。在特征选择方法中，不推荐基于排列的特征重要性、正则化及XGBoost权重特征重要性；相关系数缩减法的表现亦不理想。而基于增益的XGBoost特征重要性展现出最佳一致性与最优性能。分类特征编码方法在不同数据集结构下呈现显著性能差异：虽无普适性“最优”方法，但频率编码在复杂数据集（Lending Club）中表现最优，而在所有合成（即较简单）数据集中表现最差。最后，缺失指示变量插补法在插补方法中性能占优，而树模型插补法表现出极差且高度波动的模型性能。