Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
翻译:在大型语言模型(LLMs)迅猛演进的背景下,评估对于理解和推动这些模型发展的重要性日益凸显。已有评估表明,规模、训练范式、架构等因素对LLM性能具有深远影响。然而,这些影响的程度与本质仍存争议,因为多数评估局限于有限数量的模型与数据点。通过统计学视角,可以更有效地厘清这些因素对性能评分的影响效应。本研究针对当前评估方法存在的缺陷,对LLMs展开全面再审视。依托统一评估框架的建立,我们利用包含海量评估结果的扩展数据集,引入了包括方差分析、Tukey HSD检验、广义加性混合模型及聚类技术在内的系统化统计方法,为解析LLM性能数据提供了鲁棒且透明的途径。与主流观点相悖,本研究挑战了关于涌现能力、特定训练范式及架构影响的既有假设。这些发现为LLM的特性、本质属性及发展轨迹提供了新视角。通过提供简便可靠的方法来检验与再评估LLM性能数据,本研究为LLM的效率与潜力贡献了细致入微的见解。