Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests on 17 LLMs for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests on LLMs are essential before interpreting their scores. Our findings also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.
翻译:心理测量测试正日益广泛地应用于评估大语言模型(LLMs)的心理构念。然而,这些最初为人类开发的测试应用于LLMs时是否能产生有意义的结果,目前尚不明确。本研究系统评估了人类心理测量测试在17个LLMs上对三种构念(性别歧视、种族歧视和道德观)的信度与效度。我们发现,在多种题目和提示变体中,测试表现出中等程度的信度。效度评估则通过聚合效度(即检验基于理论的测试间相关性)和生态效度(即检验测试分数与真实世界下游任务行为的一致性)两种方法进行。关键的是,我们发现心理测量测试分数与模型在下游任务中的行为并不一致,在某些情况下甚至呈负相关,这表明其生态效度较低。我们的研究结果强调,在解读心理测量测试分数之前,必须对LLMs进行系统性的测试评估。研究还表明,为人类设计的心理测量测试不能直接应用于LLMs,而需要经过适应性调整。