When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.
翻译:在解决难题时,语言模型能够从冗长复杂的上下文中识别相关信息。为研究语言模型在不同情境下如何解决检索任务,我们引入了ORION——一个涵盖文本理解到编程等六个领域的结构化检索任务集合。ORION中的每个任务均可抽象表示为:从上下文(如故事)中,通过请求(如问题)检索属性(如角色名称)。我们对18个参数量从1.25亿到700亿的开源语言模型进行了因果分析,发现语言模型以模块化方式内部解构检索任务:末位Token的中间层处理请求,而深层网络则从上下文中检索正确实体。在因果强制实施这种分解后,模型仍能解决原始任务——在106个模型-任务组合中的98个组合中,保留了70%的正确Token概率。通过对Pythia-2.8b问答任务的细粒度案例研究,我们将宏观分解与微观描述联系起来。基于高层理解,我们展示了可扩展内部监督语言模型的概念验证应用——在仅需对单个输入进行人工监督的情况下,有效缓解提示注入攻击。该方法显著提升准确率(Pythia-12b上从15.5%提升至97.5%)。本研究揭示了跨领域与跨模型的通用涌现式模块化任务处理机制,是将可解释性应用于语言模型可扩展内部监督的开创性探索。