In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used four popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.0. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature in the range 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to hold regardless of the LLM, the prompt-engineering technique, or the problem domain. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature.
翻译:本研究实证探讨了采样温度对大型语言模型(LLMs)在各类问题求解任务中表现的影响。我们通过从标准LLM基准测试中随机抽取题目,构建了多项选择题问答(MCQA)测试集。随后,采用四种主流LLM及五种提示工程策略,在将采样温度从0.0逐步提升至1.0的过程中求解MCQA题目。尽管存在相悖的传闻性报告,但实证结果表明:温度在0.0至1.0范围内的变化对LLM在问题求解任务上的表现并无统计学显著影响。此外,该结论似乎独立于所采用的LLM模型、提示工程策略或问题领域。所有代码、数据及补充材料均可在GitHub获取:https://github.com/matthewrenze/jhu-llm-temperature