You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. In particular, to properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs of particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting enables LLMs to provide responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLM's capabilities to generate accurate answers, as well as consistency variations to examine their consistency towards simple perturbations such as switching the option order. Our experiments on 15 different open-source LLMs reveal that even simple perturbations are sufficient to significantly downgrade a model's question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately capture model perceptions, and we discuss potential alternatives to improve such issues.

翻译：大型语言模型（LLM）在自然语言理解任务上的广泛适用性使其成为社会科学研究的热门工具。为深入理解LLM的特性与固有行为模式，研究者们通过设计包含特定观点询问的提示问题开展实验。本研究对此持审慎态度，系统考察当前提示格式是否能使LLM保持稳定一致的响应能力。我们首先构建了一个包含693个问题的数据集，覆盖了115个人格轴上的39种不同人格测量工具。同时，我们设计了一组包含微小变化的提示，测试LLM生成准确答案的能力，并通过一致性变体实验检验其对简单扰动（如选项顺序调换）的响应稳定性。在15个开源LLM上的实验表明：即使是简单的扰动也足以显著降低模型的问答能力，且多数LLM对否定表述的一致性欠佳。研究结果表明，当前广泛采用的提示方法不足以准确捕捉模型认知，我们据此讨论了改进此类问题的潜在替代方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/