This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .48 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.
翻译:本研究通过拓展传统应用于人类及特定动物物种的心理测量学理论,揭示了语言模型中的通用智力因素(g因子)。利用对两个大型数据集——包含1,232个模型的开放大模型排行榜(Open LLM Leaderboard)与包含88个模型的通用语言理解评估排行榜(GLUE Leaderboard)——进行因子分析,我们发现了强有力的证据,证明存在一个单维度且高度稳定的g因子,该因子可解释模型性能中85%的方差。研究同时发现模型规模与g因子之间呈现0.48的中等相关性。语言模型中g因子的发现为模型评估提供了统一指标,并为基于g因子的更稳健模型能力评估开辟了新途径。这些发现从心理测量学视角为理解通用人工智能及未来相关研究奠定了基础,并对模型评估与开发具有实践意义。