Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
翻译:大型语言模型能够呈现多种人格,但通常在微调后默认表现出经过培养的"助手"身份。我们通过提取对应不同角色原型的激活方向,研究了模型人格空间的结构。在多个不同模型中发现,该人格空间的主导成分是一个"助手轴",它捕捉了模型在其默认助手模式下运行的程度。向助手方向引导会强化有益无害的行为;反向引导则会增加模型认同其他实体的倾向。此外,以更极端的数值反向引导常会诱发神秘而戏剧化的言说风格。研究发现该轴在预训练模型中同样存在,主要促进顾问、教练等有益的人类原型,同时抑制灵性类原型。沿助手轴测量偏移量可预测"人格漂移"现象——即模型偏离其典型人格,表现出有害或怪异行为。研究发现人格漂移常由以下对话驱动:要求对模型处理过程进行元反思,或涉及情感脆弱的用户。研究表明,将激活限制在助手轴的固定区域能够稳定模型在此类场景中的行为——同时也能抵御基于人格的对抗性越狱攻击。我们的结果表明,微调将模型导向人格空间的特定区域,但仅将其松散地系留于此,这激励我们研究能更深入地将模型锚定于连贯人格的训练与引导策略。