Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.
翻译:人格心理学家已分析人类社会人格特质与安全行为的关系。尽管大语言模型展现出人格特质,但其人格特质与安全能力(如毒性、隐私和公平性)间的关联仍属未知。本文基于可靠的MBTI-M量表发现,大语言模型的人格特质与其安全能力密切相关。同时,安全对齐普遍增强各类大语言模型的外倾性、实感性和判断性特质。基于此发现,我们可通过编辑模型人格特质提升其安全表现——例如将人格从ISTJ诱导至ISTP,可使隐私与公平性表现分别获得约43%和10%的相对提升。此外,研究发现不同人格特质的大语言模型对越狱攻击的敏感性存在差异。本研究开创了从人格视角探究大语言模型安全性的先河,为增强模型安全性提供了新思路。