Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education. LLMs become more than mere applications, evolving into assistants capable of addressing diverse user requests. This narrows the distinction between human beings and artificial intelligence agents, raising intriguing questions regarding the potential manifestation of personalities, temperaments, and emotions within LLMs. In this paper, we propose a framework, PsychoBench, for evaluating diverse psychological aspects of LLMs. Comprising thirteen scales commonly used in clinical psychology, PsychoBench further classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. Our study examines five popular models, namely \texttt{text-davinci-003}, ChatGPT, GPT-4, LLaMA-2-7b, and LLaMA-2-13b. Additionally, we employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs. We have made PsychoBench openly accessible via \url{https://github.com/CUHK-ARISE/PsychoBench}.
翻译:大语言模型(LLMs)近期展现了卓越能力,不仅在自然语言处理任务中,还在临床医学、法律咨询和教育等不同领域。LLMs不再仅仅是应用程序,而是演变为能够处理多样化用户需求的助手。这缩小了人类与人工智能代理之间的区别,引发了关于LLMs是否可能展现个性、气质和情感的引人思考的问题。本文提出了一个框架PsychoBench,用于评估LLMs的多维度心理特征。PsychoBench包含临床心理学中常用的十三个量表,并进一步将这些量表分为四类:人格特质、人际关系、动机测试和情感能力。我们的研究评估了五种主流模型,即\texttt{text-davinci-003}、ChatGPT、GPT-4、LLaMA-2-7b和LLaMA-2-13b。此外,我们采用越狱方法绕过安全对齐协议,以测试LLMs的内在本质。我们已将PsychoBench通过\url{https://github.com/CUHK-ARISE/PsychoBench} 开放访问。