Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
翻译:大型语言模型(LLMs)展现出日益增强的对话流畅性,但为其注入细腻、类人的情感表达仍是一项重大挑战。现有的对齐技术通常仅处理表层输出或需要大量微调。本文证明,通过定向激活工程可以引导LLaMA 3.1-8B表现出更接近人类的情感细微差别。我们首先采用归因修补技术识别因果影响组件,通过在诊断性对话任务中观察激活模式,找到一个关键干预位点。随后,我们从对比文本对(目标情感的积极与消极示例)生成的激活差异中推导出情感表达向量。将这些向量应用于新的对话提示时,情感特征显著增强:引导后的响应显示出更高的积极情感(如喜悦、信任)和更频繁的第一人称代词使用,表明更强的个人参与度。我们的研究为对话AI研究提供了一个精确且可解释的框架及新方向。