Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.
翻译:分类编码器将分类特征转换为数值表示,这对于各类机器学习模型至关重要。现有编码器基准研究因在以下三方面选择有限而缺乏普适性:(1)编码器类型、(2)实验因素及(3)数据集。此外,采用不同聚合策略也导致了结果不一致性。本文是迄今为止最全面的分类编码器基准研究,系统评估了来自不同类别的32种编码器配置,涵盖36种实验因素组合,并在50个数据集上进行了验证。研究表明,数据集选择、实验因素及聚合策略对基准结论具有深远影响——而这些方面在以往的编码器基准研究中均被忽视。