The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how those representations encode human-interpretable concepts is a fundamental problem. One prominent approach for the identification of concepts in neural representations is searching for a linear subspace whose erasure prevents the prediction of the concept from the representations. However, while many linear erasure algorithms are tractable and interpretable, neural networks do not necessarily represent concepts in a linear manner. To identify non-linearly encoded concepts, we propose a kernelization of a linear minimax game for concept erasure. We demonstrate that it is possible to prevent specific non-linear adversaries from predicting the concept. However, the protection does not transfer to different nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded concept remains an open problem.
翻译:神经网络模型对文本数据的表示空间在训练过程中以无监督方式涌现。理解这些表示如何编码人类可解释的概念是一个根本性问题。识别神经表示中概念的主要方法之一是寻找一个线性子空间,该子空间的擦除可阻止从表示中预测该概念。然而,尽管许多线性擦除算法具有可处理性和可解释性,但神经网络并不一定以线性方式表示概念。为识别非线性编码的概念,我们提出了一种针对概念擦除的线性极小极大博弈的核化方法。我们证明了可以阻止特定的非线性对抗者预测该概念,但这种防护无法迁移至不同的非线性对抗者。因此,彻底擦除非线性编码的概念仍是一个开放性问题。