This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .49 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.
翻译:本研究通过心理测量学理论(传统上适用于人类及特定动物物种)揭露了语言模型中的通用智能因子(即g因子)。通过对两个大规模数据集——包含1232个模型的开源LLM排行榜(Open LLM Leaderboard)与包含88个模型的通用语言理解评估(GLUE)排行榜——进行因子分析,我们发现了令人信服的证据,表明存在一个单维度、高度稳定的g因子,该因子解释了模型性能中85%的变异。研究还发现模型规模与g因子之间存在中等程度的正相关(相关系数为0.49)。语言模型中g因子的发现为模型评估提供了统一指标,并为基于g因子的更鲁棒模型能力评估开辟了新途径。这些发现从心理测量学视角奠定了理解与未来研究人工通用智能的基础,并对模型评估与开发具有实践意义。