Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents, referred to as LLM personas, and present a case study with ChatGPT and GPT-4. The study investigates whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we create distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, there are significant correlations between the assigned personality types and certain psycholinguistic features of their writings, as measured by the Linguistic Inquiry and Word Count (LIWC) tool. Interestingly, human evaluators perceive the stories as less personal when told that the stories are authored by AI. However, their judgments on other aspects of the writing such as readability, cohesiveness, redundancy, likeability, and believability remain largely unaffected. Notably, when evaluators were informed about the AI authorship, their accuracy in identifying the intended personality traits from the stories decreased by more than 10% for some traits. This research marks a significant step forward in understanding the capabilities of LLMs to express personality traits.
翻译:尽管大语言模型在创建个性化聊天机器人方面有许多应用场景,但关于评估个性化大语言模型的行为能否准确且一致地反映特定人格特质的研究仍然有限。本文考虑研究基于大语言模型的智能体(称为LLM人格化身)的行为,并以ChatGPT和GPT-4为例进行案例研究。该研究旨在探究大语言模型能否生成与其指定人格画像一致的内容。为此,我们基于大五人格模型创建了不同的LLM人格化身,让它们完成包含44项的大五人格量表人格测试和一项故事写作任务,随后通过自动评估和人工评估对其写作内容进行分析。结果表明,LLM人格化身自我报告的大五人格量表得分与其指定人格类型一致,且五个特质均有较大的效应量。此外,指定人格类型与其写作内容中某些心理语言学特征存在显著相关性(通过语言查询与词计数工具测量)。有趣的是,当被告知故事由人工智能生成时,人工评估者对故事的“个性化程度”感知降低;然而,他们在故事的可读性、连贯性、冗余性、喜爱度和可信度等其他维度的判断基本不受影响。值得注意的是,当评估者获知故事的创作主体为人工智能时,他们对部分特质中指定人格类型的识别准确率下降了超过10%。本研究标志着在理解大语言模型表达人格特质能力方面迈出了重要一步。