Language Models (LMs) often must integrate facts they memorized in pretraining with new information that appears in a given context. These two sources can disagree, causing competition within the model, and it is unclear how an LM will resolve the conflict. On a dataset that queries for knowledge of world capitals, we investigate both distributional and mechanistic determinants of LM behavior in such situations. Specifically, we measure the proportion of the time an LM will use a counterfactual prefix (e.g., "The capital of Poland is London") to overwrite what it learned in pretraining ("Warsaw"). On Pythia and GPT2, the training frequency of both the query country ("Poland") and the in-context city ("London") highly affect the models' likelihood of using the counterfactual. We then use head attribution to identify individual attention heads that either promote the memorized answer or the in-context answer in the logits. By scaling up or down the value vector of these heads, we can control the likelihood of using the in-context answer on new data. This method can increase the rate of generating the in-context answer to 88\% of the time simply by scaling a single head at runtime. Our work contributes to a body of evidence showing that we can often localize model behaviors to specific components and provides a proof of concept for how future methods might control model behavior dynamically at runtime.
翻译:语言模型(LM)通常需要将预训练中记忆的事实与给定上下文中出现的新信息进行整合。这两种来源可能相互矛盾,导致模型内部产生竞争,且目前尚不明确LM将如何解决这种冲突。在一个查询世界首都知识的基准数据集上,我们研究了此类情境下LM行为的分布性和机制性决定因素。具体而言,我们测量了LM使用反事实前缀(例如"波兰的首都是伦敦")覆盖其预训练知识("华沙")的时间比例。在Pythia和GPT2模型中,查询国家("波兰")和上下文城市("伦敦")的训练频率显著影响模型采用反事实表述的可能性。随后,我们利用头部归因方法识别出在logits中分别促进记忆答案或上下文答案的个体注意力头。通过缩放这些注意力头的值向量,我们可控制模型在新数据上采用上下文答案的可能性。该方法仅需在运行时缩放单个头部,即可将生成上下文答案的概率提升至88%。本研究为"模型行为可定位于特定组件"这一证据体系做出贡献,并为未来方法在运行时动态控制模型行为提供了概念验证。