Pretrained language models (LMs) encode implicit representations of knowledge in their parameters. However, localizing these representations and disentangling them from each other remains an open problem. In this work, we investigate whether pretrained language models contain various knowledge-critical subnetworks: particular sparse computational subgraphs that can, if removed, precisely suppress specific knowledge the model has memorized. We propose a multi-objective differentiable masking scheme that can be applied to both weights and neurons to discover such subnetworks and show that we can use them to precisely remove specific knowledge from models while minimizing adverse effects on the behavior of the original model. We demonstrate our method on multiple GPT2 variants, uncovering highly sparse subnetworks (98%+ sparsity) that are critical for expressing specific collections of relational knowledge. When these subnetworks are removed, the remaining network maintains most of its initial abilities but struggles to represent the suppressed knowledge.
翻译:预训练语言模型(LMs)在其参数中编码了知识的隐式表征。然而,定位这些表征并将其彼此解耦仍是一个开放性问题。在本工作中,我们探究预训练语言模型是否包含多种知识关键子网络:即特定的稀疏计算子图,若将其移除,可精确抑制模型已记忆的特定知识。我们提出了一种多目标可微分掩码方案,该方案可同时应用于权重和神经元以发现此类子网络,并证明我们可以利用它们从模型中精确移除特定知识,同时将对原始模型行为的不利影响降至最低。我们在多个GPT2变体上验证了我们的方法,发现了对于表达特定关系知识集合至关重要的高度稀疏子网络(稀疏度达98%以上)。当这些子网络被移除后,剩余网络保留了其大部分初始能力,但在表征被抑制的知识方面存在困难。