Previous interpretations of language models (LMs) miss important distinctions in how these models process factual information. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on having the exact knowledge of the birthplace of the Swedish author or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we investigate four different prediction scenarios for which the LM can be expected to show distinct behaviors. These scenarios correspond to different levels of model reliability and types of information being processed - some being less desirable for factual predictions. To facilitate precise interpretations of LMs for fact completion, we propose a model-specific recipe called PrISM for constructing datasets with examples of each scenario based on a set of diagnostic criteria. We apply a popular interpretability method, causal tracing (CT), to the four prediction scenarios and find that while CT produces different results for each scenario, aggregations over a set of mixed examples may only represent the results from the scenario with the strongest measured signal. In summary, we contribute tools for a more granular study of fact completion in language models and analyses that provide a more nuanced understanding of how LMs process fact-related queries.
翻译:先前对语言模型(LMs)的解释忽略了这些模型处理事实信息的重要区别。例如,对于查询"Astrid Lindgren was born in"及其对应补全"Sweden",现有方法未能区分预测是基于对这位瑞典作家出生地的确切知识,还是基于对具有瑞典风格姓名的人出生在瑞典的假设。本文研究了语言模型可能表现出不同行为的四种预测场景。这些场景对应不同的模型可靠性水平和信息处理类型——其中某些类型对于事实预测而言较不理想。为促进对语言模型事实补全的精确解释,我们提出一种名为PrISM的模型特定方案,该方案基于一组诊断标准构建包含各场景示例的数据集。我们将流行的可解释性方法——因果追踪(CT)应用于这四种预测场景,发现虽然CT对每个场景产生不同结果,但对混合示例集的聚合分析可能仅代表测量信号最强场景的结果。总之,我们贡献了用于更精细研究语言模型事实补全的工具,并通过分析提供了对语言模型如何处理事实相关查询的更细致理解。