心理测量测试对大型语言模型有效吗？性别歧视、种族歧视与道德测试评估 (Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality)

Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

翻译：心理测量测试正越来越多地用于评估大型语言模型（LLMs）的心理构念。然而，这些最初为人类设计的测试应用于LLMs时是否会产生有意义的结果，目前尚不明确。本研究系统评估了针对性别歧视、种族歧视和道德这三种构念的人类心理测量测试在LLMs中的信度与效度。研究发现，在多种题目和提示变体下，测试表现出中等程度的信度。效度评估同时采用收敛效度（即检验基于理论的测试间相关性）和生态效度（即检验测试分数与真实世界下游任务中行为表现的一致性）两种方法。关键发现是：心理测量测试分数与模型在下游任务中的行为表现并不一致，在某些情况下甚至呈负相关，这表明其生态效度较低。我们的研究结果强调，在解释心理测量测试分数之前，必须对其进行系统评估。同时表明，为人类设计的心理测量测试不能直接应用于LLMs，而需要经过适应性调整。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日