As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using personality self-assessment tests. In this paper, we take three such studies on personality measurement of LLMs that use personality self-assessment tests created to study human behavior. We use the prompts used in these three different papers to measure the personality of the same LLM. We find that all three prompts lead very different personality scores. This simple test reveals that personality self-assessment scores in LLMs depend on the subjective choice of the prompter. Since we don't know the ground truth value of personality scores for LLMs as there is no correct answer to such questions, there's no way of claiming if one prompt is more or less correct than the other. We then introduce the property of option order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the answers to the self-assessment tests are not robust to the order of the options. These simple tests, done on ChatGPT and Llama2 models show that self-assessment personality tests created for humans are not appropriate for measuring personality in LLMs.
翻译:随着大语言模型(LLM)能力的发展,近期多项研究尝试使用创建用于研究人类行为的心理工具来量化其行为。其中典型案例是利用人格自评量表测量LLM的"人格"。本文选取三项利用人类行为研究的人格自评量表对LLM进行人格测量的研究,采用这三篇论文中使用的提示模板对同一LLM进行人格测量。我们发现三个提示模板会得出截然不同的人格评分。这一简单测试表明,LLM的人格自评得分取决于提示者的主观选择。由于此类问题不存在标准答案,我们无法获知LLM人格得分的真实值,因此无法判定某个提示模板的准确性是否优于另一个。继而我们提出LLM人格测量的选项顺序对称性属性。鉴于大多数自评测试采用多选题(MCQ)形式,我们认为评分不仅应对提示模板具备鲁棒性,还应能抵御选项呈现顺序的影响。不出所料,这项测试表明自评测试的答案对选项顺序缺乏鲁棒性。这些在ChatGPT和Llama2模型上开展的简单测试证明,为人类设计的自评测试并不适用于测量LLM的人格特征。