The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
翻译:大型语言模型(LLMs)的发展已超越传统评估方法的范畴。这一进展带来了新的挑战,例如如何测量类人心理构念、超越静态且任务特定的基准测试,以及建立以人为中心的评估体系。这些挑战与心理测量学——一门量化人类心理无形特质(如人格、价值观和智力)的科学——产生了交叉。本综述论文介绍并整合了新兴的跨学科领域“LLM心理测量学”,该领域利用心理测量工具、理论和原则来评估、理解并增强LLMs。所综述的文献系统性地塑造了基准测试原则,拓宽了评估范围,完善了方法论,验证了结果,并推进了LLMs的能力。通过整合多元视角,本文为跨学科研究者提供了一个结构化框架,以促进对这一新兴领域更全面的理解。最终,本综述为开发未来评估范式提供了可行见解,这些范式旨在实现人类水平的人工智能,并推动以人为中心的人工智能系统的发展,从而造福社会。精选的LLM心理测量资源库可在 https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics 获取。