The manipulation of the personality traits of large language models (LLMs) has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability; IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLoRA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and LLaMA-2-7B-chat began generating emojis, even though no emojis were present in the PEFT data. For instance, LLaMA-2-7B-chat generated emojis in 99.5% of extraversion-related test instances, while Mistral-7B-Instruct did so in 92.5% of openness-related test instances. ICL Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. Mechanistic Interpretability analysis showed that this latent behaviour of LLMs could be traced to specific neurons that became activated or amplified after PEFT. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT's superiority over IKE in personality manipulation; and finally, analysing and validating emoji usage through explainability methods such as Mechanistic Interpretability and In-context learning Explainability methods.
翻译:大语言模型(LLM)人格特质的操纵已成为一个关键研究领域。诸如基于提示的上下文知识编辑(IKE)和基于梯度的模型编辑网络(MEND)等方法已被探索,但表现出不规则性和可变性;IKE依赖于提示,导致结果多变且敏感,而MEND则产生不一致且无意义的输出。为解决此问题,我们采用基于意见问答的参数高效微调(PEFT),特别是量化低秩适应(QLoRA),来操纵大五人格特质:开放性、尽责性、外向性、宜人性和神经质。经过PEFT后,Mistral-7B-Instruct和LLaMA-2-7B-chat等模型开始生成表情符号,尽管PEFT数据中并未包含任何表情符号。例如,LLaMA-2-7B-chat在外向性相关测试实例中99.5%的情况下生成了表情符号,而Mistral-7B-Instruct在开放性相关测试实例中92.5%的情况下生成了表情符号。上下文学习可解释性分析表明,LLM有意使用表情符号来表达这些特质。机制可解释性分析显示,LLM的这种潜在行为可追溯到PEFT后被激活或放大的特定神经元。本文提供了多项新颖贡献。首先,引入了用于PEFT驱动人格操纵的意见问答数据集;其次,开发了基准化LLM人格特质的度量模型;第三,证明了PEFT在人格操纵上优于IKE;最后,通过机制可解释性和上下文学习可解释性等方法分析并验证了表情符号的使用。