Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using $59$ regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than $100$ is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.
翻译:树提升是一种广泛应用于表格数据的机器学习技术。然而,其样本外预测精度关键取决于多个超参数。本文通过59个回归与分类数据集,实证比较了树提升算法中几种主流的超参数优化方法,包括随机网格搜索、树结构Parzen估计器(TPE)、基于高斯过程的贝叶斯优化(GP-BO)、Hyperband、序列模型算法配置(SMAC)方法以及确定性全网格搜索。研究发现,SMAC方法显著优于所有其他对比方法。我们进一步观察到:(i)精确调参需要相对较大的试验次数(超过100次);(ii)使用超参数默认值会导致模型精度严重不足;(iii)所有考察的超参数均可能对树提升的精度产生实质性影响,即不存在某个小型超参数子集比其他参数更为重要;(iv)对于回归任务,通过早停法确定提升迭代次数相比将其纳入搜索空间能获得更精确的结果。