Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.
翻译:大型语言模型(LLMs)展现出独特且一致的人格特征,这些特征显著影响用户信任与交互体验。这意味着人格框架将成为表征和控制LLM行为的宝贵工具,然而现有方法要么成本高昂(需后训练),要么稳定性不足(依赖提示工程)。通过线性方向进行探测与调控近期成为一种经济高效的替代方案。本文研究与大五人格特质对齐的线性方向能否用于探测和调控模型行为。我们使用Llama 3.3 70B模型生成406个虚构角色的描述及其大五特质评分,随后基于这些描述和Alpaca问卷中的问题对模型进行提示,从而采样得到沿已知可量化人格特质变化的隐藏激活状态。通过线性回归方法,我们在激活空间中学习得到一组逐层线性方向,并测试其在行为探测与调控中的有效性。实验结果表明:与特质评分对齐的线性方向能有效实现人格探测;而其调控能力高度依赖上下文环境——在强制选择任务中能产生可靠效果,但在开放式生成任务或提示包含额外上下文时影响有限。