The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip. https://github.com/nicholaslourie/opda
翻译:超参数的选择对自然语言处理中的性能影响极大。通常,很难判断一种方法是否优于另一种方法,或者仅仅是调优得更好。调优曲线通过考虑调优努力来解决这种模糊性:它们绘制验证性能作为已尝试超参数选择数量的函数。尽管存在多种针对这些曲线的估计器,但常用点估计,我们证明其在数据不足时无声失效并给出矛盾结果。除点估计外,置信带对于严谨建立不同方法之间的关系至关重要。我们提出了构建调优曲线有效置信带的首种方法。这些置信带精确、同步且无分布依赖,因此为比较方法提供了稳健基础。实证分析表明,作为基线的自举置信带无法逼近其目标置信水平,而我们的方法能精确实现。我们通过消融实验验证设计,分析样本量的影响,并提供使用我们的方法比较模型的指导。为促进未来工作中的自信比较,我们发布了opda:一个可通过pip安装的易用库。https://github.com/nicholaslourie/opda