Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline -- the expected accuracy of guessing labels uniformly at random -- is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20\% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.
翻译:评估语言模型在上下文学习中的分类性能面临诸多挑战,包括数据集规模较小、需通过验证集进行大量提示选择,以及任务本身设计复杂导致性能接近随机水平。标准随机基线——即均匀随机猜测标签的期望准确率——在仅使用一次评估集或数据集规模较大时保持稳定。针对验证集重复使用和现有小数据集的常见实践,我们提出更强的随机基线:多个随机分类器所能达到的预期最大准确率。在将六种量化语言模型应用于16个BIG-bench Lite任务并选择最佳提示示例时,超过20%的超越标准基线的少样本结果并未超越此更强随机基线。当存在独立测试集时,该更强基线比标准基线更能预测测试集性能,从而避免不必要的测试集评估。该最大随机基线可作为标准基线的即插即用替代方案,且计算简便。