Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.
翻译:单个神经元参与多个高层概念的表示。不同的可解释性方法能在多大程度上成功解耦这些角色?为回答该问题,我们提出RAVEL(语言模型中属性-值纠缠解耦基准),该数据集能够对多种现有可解释性方法进行严格控制的定量比较。基于由此产生的概念框架,我们定义了一种新方法——多任务分布式对齐搜索(MDAS),该方法能搜索满足多重因果准则的分布式表征。以Llama2-7B作为目标语言模型,MDAS在RAVEL基准上取得了最优结果,证明了超越神经元层级分析、识别分布在激活值中的特征的重要性。我们已在https://github.com/explanare/ravel开源该基准测试。