Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.
翻译:当前大型语言模型中的人格调控方法依赖于静态提示或昂贵的微调,难以捕捉人类特质的动态性与组合性。我们提出PERSONA——一种无需训练的框架,通过在激活空间直接操控人格向量实现微调级性能。核心洞见在于:人格特质在模型表示空间中呈现为可提取、近似正交的方向,并支持代数运算。该框架通过三阶段实现:Persona-Base通过对比激活分析提取正交特质向量;Persona-Algebra通过向量运算实现精确调控(标量乘法调节强度、加法实现组合、减法实现抑制);Persona-Flow在推理时动态组合这些向量以实现情境自适应。在PersonalityBench评测中,本方法获得9.60的平均分,在无需梯度更新的情况下接近监督微调上限9.61。在我们提出的动态人格适应基准Persona-Evolve上,跨不同模型系列实现最高91%的胜率。这些结果表明大语言模型的人格特征具有数学可解性,为可解释且高效的行为调控开辟了新方向。