As language models continue to scale in size and capability, they display an array of emerging behaviors, both beneficial and concerning. This heightens the need to control model behaviors. We hope to be able to control the personality traits of language models at the inference-time so as to have various character features, on top of which the requirements of different types of tasks can be met. Personality is a higher-level and more abstract behavioral representation for language models. We introduce ControlLM, which leverages differential activation patterns, derived from contrasting behavioral prompts in the model's latent space, to influence the model's personality traits at inference. This approach allows for the precise, real-time adjustment of model behavior. First, we demonstrate ControlLM's capacity to elicit diverse persona behaviors without any training, while precision control allows personality traits to closely match average human values. Subsequently, we showcase improved reasoning and question answering through selective amplification of beneficial attributes like conscientiousness and friendliness. We hope that this work will inspire research on controlling human-like behaviors of language models and provide insights for future research. Our code is publicly available at: https://github.com/wengsyx/ControlLM.
翻译:随着语言模型在规模和能力上持续扩展,它们展现出各种涌现行为,既有有益的也有令人担忧的。这加剧了对模型行为进行控制的需求。我们希望在推理时能够控制语言模型的人格特质,从而拥有多样的角色特征,以满足不同类型任务的需求。人格是语言模型更高层次且更抽象的行为表征。我们提出ControlLM,该方法利用对比行为提示在模型潜在空间中产生的差异激活模式,在推理时影响模型的人格特质。这种方法能够精确、实时地调整模型行为。首先,我们展示了ControlLM在无需训练的情况下引发多样化角色行为的能力,同时精确控制使人格特质紧密匹配平均人类价值观。随后,我们通过选择性放大尽责性和友好性等有益属性,展示了改进的推理和问答能力。我们希望这项工作能激发对语言模型类人行为控制的研究,并为未来研究提供启示。我们的代码已公开在:https://github.com/wengsyx/ControlLM。