Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts -- questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
翻译:大型语言模型(LLMs)在解决各类任务中展现出卓越能力,正逐步演变为通用型智能助手。随着LLMs日益融入社会生活,学界开始关注其是否表现出心理模式,以及这些模式在不同情境下是否保持一致性——这些问题的探索有助于深化对其行为机制的理解。受心理测量学启发,本文提出了一个用于量化LLMs心理构念的综合评估框架,涵盖心理维度识别、评估数据集构建及结果验证的完整流程。我们识别出五个核心心理构念——人格特质、价值取向、情绪智力、心理理论与自我效能感——通过包含多样化场景和项目类型的13个数据集进行评估。研究发现LLMs的自我报告特质与其在真实场景中的响应模式存在显著差异,揭示了其行为机制的复杂性。研究还表明,部分原用于人类评估的偏好测试难以从LLMs获取可靠响应。本文为LLMs提供了系统的心理测量学评估,为人工智能与社会科学领域的可靠评估方法及潜在应用提供了新的见解。