The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .
翻译:超参数的选择对自然语言处理的性能影响极大。通常,我们难以判断一种方法是否优于另一种,或仅仅是调优更充分。调参曲线通过考虑调优努力来消除这种歧义:它绘制了验证性能随已尝试超参数选择数量变化的函数。尽管已有多种针对这些曲线的估计方法,但常用点估计方法——我们证明,在数据量不足时,点估计会静默失效并产生矛盾结果。除点估计外,严格建立不同方法之间的关系需要置信带。我们首次提出构建调参曲线有效置信带的方法。这些置信带是精确、同步且无分布的,从而为方法比较提供了稳健基础。实证分析表明,作为基线的自助法置信带无法逼近目标置信水平,而我们的方法能精确实现。我们通过消融实验验证设计,分析样本量的影响,并提供了使用该方法比较模型的指导。为促进未来工作中的可靠比较,我们开源了实现该方法的库:https://github.com/nalourie/opda 。