We investigate an approach for extracting knowledge from trained neural networks based on Angluin's exact learning model with membership and equivalence queries to an oracle. In this approach, the oracle is a trained neural network. We consider Angluin's classical algorithm for learning Horn theories and study the necessary changes to make it applicable to learn from neural networks. In particular, we have to consider that trained neural networks may not behave as Horn oracles, meaning that their underlying target theory may not be Horn. We propose a new algorithm that aims at extracting the "tightest Horn approximation" of the target theory and that is guaranteed to terminate in exponential time (in the worst case) and in polynomial time if the target has polynomially many non-Horn examples. To showcase the applicability of the approach, we perform experiments on pre-trained language models and extract rules that expose occupation-based gender biases.
翻译:我们研究了一种基于Angluin精确学习模型的方法,通过向神谕(oracle)进行成员查询和等价查询,从训练好的神经网络中抽取知识。在该方法中,神谕即为训练好的神经网络。我们考虑使用Angluin经典算法学习Horn理论,并研究为使其适用于从神经网络中学习所需的必要改动。特别地,我们必须考虑到训练好的神经网络可能无法充当Horn神谕,即其底层目标理论可能并非Horn形式。我们提出了一种新算法,旨在抽取目标理论的"最紧Horn近似",该算法保证在最坏情况下指数时间内终止,若目标理论仅含多项式数量的非Horn示例,则可在多项式时间内终止。为展示该方法的适用性,我们在预训练语言模型上进行了实验,并提取了揭示基于职业的性别偏见的规则。