Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.
翻译:基于监督机器学习(ML)的预测模型生成与评估往往具有挑战性,尤其对应用研究领域经验较少的用户而言。在模型生成过程涉及超参数调优(即通过数据驱动方式优化不同类型的超参数以提升最终模型的预测性能)的场景中,需要特别审慎。关于调优的讨论通常聚焦于ML算法的超参数(例如基于树算法中每个终端节点的最小观测数)。在此背景下,人们往往忽略了在数据输入算法前实施的预处理步骤同样存在超参数(例如如何处理数据中缺失的特征值)。因此,尝试不同预处理选项以提升模型性能的用户可能并未意识到这本质上构成了一种超参数调优形式——尽管是非正式且非系统化的——从而可能忽略报告或考量这种优化过程。为阐明该问题,本文系统评述并实证展示了预测模型生成与评估的不同流程,明确探讨应用ML用户处理算法超参数与预处理超参数的典型方式差异。通过揭示潜在陷阱(特别是可能导致性能评估虚高的隐患),本综述旨在进一步提升ML应用中预测建模的质量。