Previous interpretations of language models (LMs) miss important distinctions in how these models process factual information. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on having the exact knowledge of the birthplace of the Swedish author or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we investigate four different prediction scenarios for which the LM can be expected to show distinct behaviors. These scenarios correspond to different levels of model reliability and types of information being processed - some being less desirable for factual predictions. To facilitate precise interpretations of LMs for fact completion, we propose a model-specific recipe called PrISM for constructing datasets with examples of each scenario based on a set of diagnostic criteria. We apply a popular interpretability method, causal tracing (CT), to the four prediction scenarios and find that while CT produces different results for each scenario, aggregations over a set of mixed examples may only represent the results from the scenario with the strongest measured signal. In summary, we contribute tools for a more granular study of fact completion in language models and analyses that provide a more nuanced understanding of how LMs process fact-related queries.
翻译:先前对语言模型(LMs)的解释忽略了这些模型在处理事实信息时的重要区别。例如,给定查询“阿斯特丽德·林格伦出生于”及相应补全“瑞典”,现有方法无法区分该预测是基于对这位瑞典作家出生地的确切知识,还是仅基于一个听起来像瑞典名字的人出生在瑞典的假设。本文研究了语言模型可能表现出不同行为的四种预测场景。这些场景对应着不同的模型可靠性水平及信息处理类型——其中某些类型对于事实预测而言是不理想的。为了促进对语言模型事实补全的精确解释,我们提出一种名为PrISM的模型特定方法,用于基于一组诊断标准构建包含各场景示例的数据集。我们将流行的可解释性方法——因果追踪(CT)应用于这四种预测场景,发现虽然CT对每个场景产生不同的结果,但对一组混合示例的聚合可能仅代表测量信号最强场景的结果。总之,我们贡献了用于更精细研究语言模型中事实补全的工具,并通过分析提供了对语言模型如何处理事实相关查询的更细致理解。