The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts to find architectures that are more interpretable than standard multilayer perceptrons (MLPs) with elementwise activation functions. In this note, I examine bilinear layers, which are a type of MLP layer that are mathematically much easier to analyze while simultaneously performing better than standard MLPs. Although they are nonlinear functions of their input, I demonstrate that bilinear layers can be expressed using only linear operations and third order tensors. We can integrate this expression for bilinear layers into a mathematical framework for transformer circuits, which was previously limited to attention-only transformers. These results suggest that bilinear layers are easier to analyze mathematically than current architectures and thus may lend themselves to deeper safety insights by allowing us to talk more formally about circuits in neural networks. Additionally, bilinear layers may offer an alternative path for mechanistic interpretability through understanding the mechanisms of feature construction instead of enumerating a (potentially exponentially) large number of features in large models.
翻译:神经网络拥有比神经元数量更多的特征表示能力,这给模型解释带来了挑战。这种被称为“叠加”的现象,促使研究者寻找比使用逐元素激活函数的标准多层感知机(MLP)更具可解释性的架构。本笔记研究了双线性层,这是一种数学上更易分析、同时性能优于标准MLP的MLP层。尽管双线性层对输入具有非线性函数特性,但我证明其仅需线性运算和三阶张量即可表达。我们将这种双线性层表达式整合到Transformer电路的数学框架中——该框架此前仅适用于纯注意力机制Transformer。结果表明,双线性层在数学分析上比现有架构更简单,因此通过允许我们更形式化地讨论神经网络中的电路,可能有助于获得更深层的安全性洞察。此外,双线性层通过理解特征构建机制(而非枚举大型模型中潜在指数级数量的特征),为机理解释性提供了另一条可行路径。