Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies. Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of what knowledge is encoded in different parts of the models, especially in individual neurons. The situation is different in computer vision, where feature visualization provides a decompositional interpretability technique for neurons of vision models. Activation maximization is used to synthesize inherently interpretable visual representations of the information encoded in individual neurons. Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons, based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically, large PLMs. We propose feature textualization, a technique to produce dense representations of neurons in the PLM word embedding space. We apply feature textualization to the BERT model (Devlin et al., 2019) to investigate whether the knowledge encoded in individual neurons can be interpreted and symbolized. We find that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clearcut symbolic units of language such as words. Additionally, we use feature textualization to investigate how many neurons are needed to encode words in BERT.
翻译:预训练语言模型(PLMs)构成了多数最先进自然语言处理技术的基础。然而,这些模型本质上仍是黑箱:人类难以清晰理解模型不同部分(尤其是单个神经元)编码的知识。这种情形在计算机视觉领域有所不同——特征可视化技术为视觉模型的神经元提供了一种分解式可解释性方法。通过激活最大化技术,可合成出单个神经元所编码信息的可直观理解的视觉表征。本研究受此启发,但基于将激活最大化首次大规模应用于自然语言处理(特别是大型PLMs)的尝试,对单个神经元可解释性提出了警示。我们提出特征文本化技术,用于生成PLM词嵌入空间中神经元的密集表征。将该技术应用于BERT模型(Devlin等,2019),探究单个神经元编码的知识是否可被解释与符号化。研究发现,所生成表征虽能揭示单个神经元编码的知识,但单个神经元并不表征语言中词汇这类明确的符号单元。此外,我们利用特征文本化方法探究编码单个词汇所需的神经元数量。