The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how those representations encode human-interpretable concepts is a fundamental problem. One prominent approach for the identification of concepts in neural representations is searching for a linear subspace whose erasure prevents the prediction of the concept from the representations. However, while many linear erasure algorithms are tractable and interpretable, neural networks do not necessarily represent concepts in a linear manner. To identify non-linearly encoded concepts, we propose a kernelization of a linear minimax game for concept erasure. We demonstrate that it is possible to prevent specific non-linear adversaries from predicting the concept. However, the protection does not transfer to different nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded concept remains an open problem.
翻译:文本数据神经模型的表示空间在训练过程中以无监督方式涌现。理解这些表征如何编码人类可解释的概念是一个基本问题。识别神经表征中概念的主要方法之一是寻找一个线性子空间,通过擦除该子空间来阻止从表征中预测概念。然而,尽管许多线性擦除算法易于处理且可解释,但神经网络并不一定以线性方式表征概念。为识别非线性编码的概念,我们提出了一种用于概念擦除的线性极小极大博弈的核化方法。我们证明,可以阻止特定的非线性对抗者预测该概念。然而,这种保护无法迁移至不同的非线性对抗者。因此,完全擦除非线性编码的概念仍是一个开放性问题。