Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.
翻译:大语言模型(LLMs)正日益被部署为战略环境中的自主决策者,但我们用于理解其高层次行为特征的工具仍然有限。我们在博弈论场景中运用激活引导方法,通过对比激活加法构建利他主义、宽恕及对他者期望的角色向量。在经典博弈评估中,我们发现激活引导系统性地改变了定量战略选择与自然语言论证。然而,我们也观察到在引导作用下,修辞与策略可能出现分化。此外,自我行为向量与对他者期望向量部分存在差异。我们的研究结果表明,角色向量为战略环境中高层次特征的机制性理解提供了有前景的研究工具。