To reach high performance with deep learning, hyperparameter optimization (HPO) is essential. This process is usually time-consuming due to costly evaluations of neural networks. Early discarding techniques limit the resources granted to unpromising candidates by observing the empirical learning curves and canceling neural network training as soon as the lack of competitiveness of a candidate becomes evident. Despite two decades of research, little is understood about the trade-off between the aggressiveness of discarding and the loss of predictive performance. Our paper studies this trade-off for several commonly used discarding techniques such as successive halving and learning curve extrapolation. Our surprising finding is that these commonly used techniques offer minimal to no added value compared to the simple strategy of discarding after a constant number of epochs of training. The chosen number of epochs depends mostly on the available compute budget. We call this approach i-Epoch (i being the constant number of epochs with which neural networks are trained) and suggest to assess the quality of early discarding techniques by comparing how their Pareto-Front (in consumed training epochs and predictive performance) complement the Pareto-Front of i-Epoch.
翻译:为了实现深度学习的高性能,超参数优化(HPO)至关重要。由于神经网络评估成本高昂,这一过程通常非常耗时。早期丢弃技术通过观察经验学习曲线,一旦候选模型的竞争力不足便立即终止其训练,从而限制分配给无望候选模型的资源。尽管已有二十年的研究,但对于丢弃激进程度与预测性能损失之间的权衡仍知之甚少。本文针对连续减半法、学习曲线外推法等几种常用丢弃技术,系统研究了这一权衡。令人惊讶的是,与简单的恒定训练轮数丢弃策略相比,这些常用技术几乎未带来额外价值。所选训练轮数主要取决于可用计算预算。我们将该方法称为i-Epoch(i为神经网络训练的恒定轮数),并建议通过比较早期丢弃技术的帕累托前沿(在消耗训练轮数与预测性能方面)与i-Epoch的帕累托前沿的互补程度,来评估前者质量。