Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.
翻译:当前对大语言模型(LLM)的评估通常忽视非确定性,往往只关注每个示例的单一输出。这限制了我们理解LLM在现实应用中的性能波动性。本研究通过探讨以下关键问题来解决这一局限:贪婪解码与采样方法之间的性能差异、基准测试在非确定性方面的一致性,以及独特的模型行为。通过大量实验,我们观察到在大多数评估任务中,贪婪解码通常优于采样方法。我们还发现不同规模和不同对齐方法的LLM表现出一致的性能,并注意到对齐能够降低采样方差。此外,我们的最佳N采样实验表明,较小的LLM能够匹配甚至超越如GPT-4-Turbo等更大模型,这凸显了较小LLM尚未开发的潜力。本研究揭示了在LLM评估中考虑非确定性的重要性,并为未来LLM的开发与评估提供了见解。