From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs

The manipulation of the personality traits of large language models (LLMs) has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability; IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLoRA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and LLaMA-2-7B-chat began generating emojis, even though no emojis were present in the PEFT data. For instance, LLaMA-2-7B-chat generated emojis in 99.5% of extraversion-related test instances, while Mistral-7B-Instruct did so in 92.5% of openness-related test instances. ICL Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. Mechanistic Interpretability analysis showed that this latent behaviour of LLMs could be traced to specific neurons that became activated or amplified after PEFT. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT's superiority over IKE in personality manipulation; and finally, analysing and validating emoji usage through explainability methods such as Mechanistic Interpretability and In-context learning Explainability methods.

翻译：大语言模型（LLM）人格特质的操纵已成为一个关键研究领域。诸如基于提示的上下文知识编辑（IKE）和基于梯度的模型编辑网络（MEND）等方法已被探索，但表现出不规则性和可变性；IKE依赖于提示，导致结果多变且敏感，而MEND则产生不一致且无意义的输出。为解决此问题，我们采用基于意见问答的参数高效微调（PEFT），特别是量化低秩适应（QLoRA），来操纵大五人格特质：开放性、尽责性、外向性、宜人性和神经质。经过PEFT后，Mistral-7B-Instruct和LLaMA-2-7B-chat等模型开始生成表情符号，尽管PEFT数据中并未包含任何表情符号。例如，LLaMA-2-7B-chat在外向性相关测试实例中99.5%的情况下生成了表情符号，而Mistral-7B-Instruct在开放性相关测试实例中92.5%的情况下生成了表情符号。上下文学习可解释性分析表明，LLM有意使用表情符号来表达这些特质。机制可解释性分析显示，LLM的这种潜在行为可追溯到PEFT后被激活或放大的特定神经元。本文提供了多项新颖贡献。首先，引入了用于PEFT驱动人格操纵的意见问答数据集；其次，开发了基准化LLM人格特质的度量模型；第三，证明了PEFT在人格操纵上优于IKE；最后，通过机制可解释性和上下文学习可解释性等方法分析并验证了表情符号的使用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日