Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.
翻译:心理测量测试正越来越多地用于评估大型语言模型(LLMs)的心理构念。然而,这些最初为人类设计的测试应用于LLMs时是否会产生有意义的结果,目前尚不明确。本研究系统评估了针对性别歧视、种族歧视和道德这三种构念的人类心理测量测试在LLMs中的信度与效度。研究发现,在多种题目和提示变体下,测试表现出中等程度的信度。效度评估同时采用收敛效度(即检验基于理论的测试间相关性)和生态效度(即检验测试分数与真实世界下游任务中行为表现的一致性)两种方法。关键发现是:心理测量测试分数与模型在下游任务中的行为表现并不一致,在某些情况下甚至呈负相关,这表明其生态效度较低。我们的研究结果强调,在解释心理测量测试分数之前,必须对其进行系统评估。同时表明,为人类设计的心理测量测试不能直接应用于LLMs,而需要经过适应性调整。