Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.
翻译:超参数优化对于实现机器学习模型的峰值性能至关重要。标准协议通过使用泛化误差的重采样估计来评估不同超参数配置,从而指导优化并选择最终的超参数配置。尽管缺乏充分证据,配对重采样划分(即固定的训练-验证划分或固定的交叉验证方案)仍常被推荐使用。我们证明,令人惊讶的是,为每个配置重新打乱划分通常能提升最终模型在未见数据上的泛化性能。我们的理论分析阐释了重排如何影响验证损失曲面的渐近行为,并给出了极限状态下期望遗憾的界。该界将重排的潜在收益与底层优化问题的信号和噪声特性联系起来。我们在受控模拟研究中验证了理论结果,并通过大规模实际超参数优化实验证明了重排的实用价值。虽然重排能达到与固定划分相当的测试性能,但它能显著提升单一训练-验证留出协议的结果,且常能使留出法在计算成本更低的情况下与标准交叉验证相竞争。