Two-step hyperparameter optimization method: Accelerating hyperparameter search by using a fraction of a training dataset

Hyperparameter optimization (HPO) is an important step in machine learning (ML) model development, but common practices are archaic -- primarily relying on manual or grid searches. This is partly because adopting advanced HPO algorithms introduces added complexity to the workflow, leading to longer computation times. This poses a notable challenge to ML applications, as suboptimal hyperparameter selections curtail the potential of ML model performance, ultimately obstructing the full exploitation of ML techniques. In this article, we present a two-step HPO method as a strategic solution to curbing computational demands and wait times, gleaned from practical experiences in applied ML parameterization work. The initial phase involves a preliminary evaluation of hyperparameters on a small subset of the training dataset, followed by a re-evaluation of the top-performing candidate models post-retraining with the entire training dataset. This two-step HPO method is universally applicable across HPO search algorithms, and we argue it has attractive efficiency gains. As a case study, we present our recent application of the two-step HPO method to the development of neural network emulators for aerosol activation. Although our primary use case is a data-rich limit with many millions of samples, we also find that using up to 0.0025% of the data (a few thousand samples) in the initial step is sufficient to find optimal hyperparameter configurations from much more extensive sampling, achieving up to 135-times speedup. The benefits of this method materialize through an assessment of hyperparameters and model performance, revealing the minimal model complexity required to achieve the best performance. The assortment of top-performing models harvested from the HPO process allows us to choose a high-performing model with a low inference cost for efficient use in global climate models (GCMs).

翻译：超参数优化（HPO）是机器学习（ML）模型开发中的重要环节，但当前常见做法仍较为陈旧——主要依赖人工调参或网格搜索。这在一定程度上是因为采用先进的HPO算法会增加工作流程的复杂度，导致计算时间延长。这对机器学习应用构成了显著挑战，因为次优的超参数选择会限制机器学习模型的潜力，最终阻碍机器学习技术的充分应用。本文基于应用机器学习参数化工作的实践经验，提出一种两步式HPO方法，作为降低计算需求与等待时间的策略性解决方案。初始阶段首先在训练数据集的少量子集上进行超参数初步评估，随后在使用完整训练数据集重新训练后，对表现优异的候选模型进行二次评估。这种两步式HPO方法适用于各类HPO搜索算法，我们认为它具有显著的效率提升优势。通过案例研究，我们展示了近期将两步式HPO方法应用于气溶胶活化的神经网络模拟器开发过程。尽管我们的主要应用场景属于数据丰富场景（包含数百万样本），但研究发现，在初始阶段仅使用0.0025%的数据（数千个样本）即可在更广泛的采样中寻得最优超参数配置，实现高达135倍的加速效果。通过超参数与模型性能的评估，该方法可揭示实现最优性能所需的最小模型复杂度。从HPO过程中筛选出的高性能模型组合，使我们能够选择兼具高精度与低推理成本的模型，从而高效应用于全球气候模型（GCM）。