\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an `Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering.
翻译:电路分析是一种有望理解语言模型内部机制的先进技术,但目前现有分析主要针对远离技术前沿的小规模模型。为解决这一局限,我们以700亿参数的Chinchilla模型为例开展电路分析案例研究,旨在检验该方法的可扩展性。具体而言,我们聚焦多选题问答任务,探究Chinchilla在已知正确答案文本的情况下识别正确答案标签的能力。研究发现,对数几率归因、注意力模式可视化与激活补丁等现有技术可自然扩展至Chinchilla模型,使我们能够识别并分类少量"输出节点"(注意力头与多层感知机)。进一步地,我们针对"正确字母"类别的注意力头展开特征语义分析,但结果呈现混合性。对于常规多选题答案,我们在操作多选题答案标签时,可在不损失性能的前提下显著压缩该注意力头的查询、键和值子空间,并证明其查询与键子空间在某种程度上表征"枚举序列中的第N项"特征。然而,当我们试图将该解释应用于包含随机化答案标签的更广泛分布场景时,发现这仅是部分解释,表明关于"正确字母"注意力头在多选题问答中的运作机制仍有待深入研究。