Your Language Model Secretly Contains Personality Subnetworks

Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetwork from the model that lead to binary-opposing personas, such as introvert-extrovert? To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space, pointing toward a new perspective on controllable and interpretable personalization in large language models.

翻译：人类会根据不同的社交情境切换不同的人格面具。大型语言模型（LLMs）在采用不同人格和行为时也表现出类似的灵活性。然而，现有方法通常通过外部知识（如提示、检索增强生成（RAG）或微调）来调整此类行为。我们提出疑问：LLMs 真的需要外部上下文或参数来适应不同的行为吗？还是它们已经将此类知识嵌入到其参数中？在本研究中，我们证明 LLMs 的参数空间中已经存在人格专用子网络。利用小型校准数据集，我们识别出与不同人格相关的独特激活特征。基于这些统计特征，我们开发了一种掩码策略来隔离轻量级人格子网络。基于这些发现，我们进一步探讨：如何从模型中识别出导致二元对立人格（如内向-外向）的对立子网络？为了在二元对立情境中进一步增强分离效果，我们引入了一种对比剪枝策略，该策略可识别导致对立人格间统计差异的参数。我们的方法完全无需训练，仅依赖于语言模型现有的参数空间。在多样化的评估设置中，所得子网络展现出比需要外部知识的基线方法显著更强的人格对齐性，同时更加高效。我们的研究结果表明，多样化的人类行为并非仅仅在 LLMs 中被诱导产生，而是已经嵌入其参数空间中，这为大型语言模型的可控和可解释个性化提供了新的视角。