We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.
翻译:我们研究了事后可解释性背景下特征归因方法中的特征交互问题。在可解释性研究中,深入理解特征交互日益被视为一项重要挑战,因为交互特征是神经网络成功的关键。特征交互使模型能够为其输入构建层次化表征,并可能为探究语言模型中的语言结构提供理想的切入点。然而,揭示这些交互所发挥的具体作用也十分困难,目前已有多种多样的交互归因方法被提出。本文聚焦于以下问题:在这些方法中,哪一种最能忠实地反映目标模型的内部运作机制。我们制定了一种灰盒方法论,在形式语言分类任务上使用PCFG将模型训练至完美状态。研究表明,在特定配置下,某些方法确实能够揭示模型习得的语法规则。基于这些发现,我们将评估拓展至语言模型的案例研究,从而为这些模型所习得的语言结构提供新的见解。