Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model's internal representations. Yet, as these mechanisms also possess unique elements that fulfill distinct roles, they lead to different degrees of response diversity (intrinsic > prompted) and value steerability (prompted > intrinsic). In particular, components unique to the intrinsic mechanism promote lexical diversity in responses, whereas those specific to the prompted mechanism strengthen instruction following, taking effect even in distant tasks like jailbreaking.
翻译:大语言模型主要通过两种方式表达价值:(1) 内在表达,反映模型在训练过程中习得的固有价值;(2) 提示表达,由显式提示所引发。鉴于这些模型在价值对齐中的广泛应用,清晰理解其底层机制至关重要——特别是这两种机制究竟是高度重叠(如通常预期),还是依赖于不同的机制,但这一问题迄今尚未得到充分研究。我们在机制层面通过两种方法进行分析:(1) 价值向量:从残差流中提取的、表征价值机制的特征方向;(2) 价值神经元:对价值向量有贡献的MLP神经元。研究表明,内在价值与提示价值机制部分共享关键组件,这些组件对诱导价值表达、跨语言泛化以及重构模型内部表征中理论上的价值间相关性具有重要作用。然而,由于这两种机制同时具备承担不同功能的独特要素,它们导致了不同程度的响应多样性(内在 > 提示)和价值可引导性(提示 > 内在)。具体而言,内在机制特有的组件促进了响应的词汇多样性,而提示机制特有的组件则增强了指令跟随能力,其影响甚至可延伸至越狱等远端任务。