Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.
翻译:单个神经元参与多个高级概念的表示。不同可解释性方法能在多大程度上成功解耦这些角色?为帮助解决这一问题,我们引入RAVEL(语言模型中属性-值纠缠解析)数据集,该数据集支持对多种现有可解释性方法进行严格控制的定量比较。我们利用所得概念框架定义了多任务分布式对齐搜索(MDAS)新方法,该方法能寻找满足多重因果准则的分布式表示。以Llama2-7B作为目标语言模型,MDAS在RAVEL上取得了最先进的结果,证明了超越神经元层面分析以识别激活中分布式特征的重要性。我们在https://github.com/explanare/ravel发布本基准测试。