Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.
翻译:超参数优化对于实现机器学习模型的峰值性能至关重要。标准流程通过使用泛化误差的重采样估计来评估不同超参数配置,以此指导优化过程并选择最终的超参数配置。尽管缺乏充分证据,配对重采样划分(即固定的训练-验证划分或固定的交叉验证方案)常被推荐使用。我们证明,令人惊讶的是,为每个配置重新划分样本往往能提升最终模型在未见数据上的泛化性能。我们的理论分析阐释了重排如何影响验证损失曲面的渐近行为,并在极限状态下给出了期望遗憾的界。该界限将重排的潜在收益与底层优化问题的信号和噪声特性联系起来。我们通过受控模拟研究验证了理论结果,并在大规模实际超参数优化实验中证明了重排的实用价值。虽然重排能达到与固定划分相当的测试性能,但它能显著提升单一训练-验证留出法的效果,且通常能使留出法在计算成本更低的情况下达到与标准交叉验证相当的水平。