Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
翻译:大语言模型(LLMs)具有随机性,即使将温度设置为零并使用固定的随机种子,也并非所有模型都会给出确定性的答案。然而,很少有基准测试研究尝试量化这种不确定性,部分原因是重复实验所需的时间和成本较高。我们利用为测试大语言模型在方位推理能力而设计的基准,探讨了实验重复次数对平均分数和预测区间的影响。我们提出了一种简单且经济高效的方法来量化基准测试分数的不确定性,并就实现可复现的大语言模型评估提出了建议。