Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.
翻译:刻画神经网络中计算过程的隐含结构是深度学习可解释性领域的基础性问题。神经网络内部的决策过程能否被某种熟悉逻辑符号化地捕捉?我们证明,任意固定精度的Transformer神经网络均可转化为等价的固定大小$\mathsf{FO}(\mathsf{M})$公式,即一种除了标准全称量词和存在量词外,还可包含多数投票量词的一阶逻辑公式。证明思路是设计能够模拟Transformer的高度均匀布尔阈值电路,并利用电路与逻辑之间已知的理论联系。我们的结果揭示了捕捉Transformer行为的惊人简洁形式,表明整数除法等简单问题是"Transformer难解的",并为比较Transformer与循环神经网络等其他模型提供了宝贵见解。该结论表明,带有多数量的第一阶逻辑可能是表达从Transformer中提取程序的有用语言。