Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering

Transformer-based language models have been shown to be highly effective for several NLP tasks. In this paper, we consider three transformer models, BERT, RoBERTa, and XLNet, in both small and large versions, and investigate how faithful their representations are with respect to the semantic content of texts. We formalize a notion of semantic faithfulness, in which the semantic content of a text should causally figure in a model's inferences in question answering. We then test this notion by observing a model's behavior on answering questions about a story after performing two novel semantic interventions: deletion intervention and negation intervention. While transformer models achieve high performance on standard question answering tasks, we show that they fail to be semantically faithful once we perform these interventions for a significant number of cases (~50% for deletion intervention, and ~20% drop in accuracy for negation intervention). We then propose an intervention-based training regime that can mitigate the undesirable effects for deletion intervention by a significant margin (from ~ 50% to ~6%). We analyze the inner-workings of the models to better understand the effectiveness of intervention-based training for deletion intervention. But we show that this training does not attenuate other aspects of semantic unfaithfulness such as the models' inability to deal with negation intervention or to capture the predicate-argument structure of texts. We also test InstructGPT, via prompting, for its ability to handle the two interventions and to capture predicate-argument structure. While InstructGPT models do achieve very high performance on predicate-argument structure task, they fail to respond adequately to our deletion and negation interventions.

翻译：基于Transformer的语言模型已被证明在多项NLP任务上表现出色。本文研究了BERT、RoBERTa和XLNet三种Transformer模型（包括大小版本），探究其表示在文本语义内容方面的忠实性。我们形式化定义了“语义忠实性”概念，即文本的语义内容应因果性地参与模型在问答中的推理过程。通过观察模型在两项新颖的语义干预（删除干预和否定干预）后回答故事相关问题的行为，我们对该概念进行了测试。尽管Transformer模型在标准问答任务上表现优异，但实验表明，在实施这些干预后，模型在大量案例中无法保持语义忠实性（删除干预下约50%的案例受影响，否定干预下准确率下降约20%）。随后，我们提出了一种基于干预的训练方法，可显著缓解删除干预带来的不利影响（从约50%降至约6%）。通过分析模型内部机制，我们深入理解了干预训练对删除干预的有效性。但研究表明，该训练无法削弱其他方面的语义不忠实性，例如模型处理否定干预或捕获文本谓词-论元结构的能力。此外，我们通过提示测试了InstructGPT处理这两类干预及捕获谓词-论元结构的能力。尽管InstructGPT模型在谓词-论元结构任务上表现极佳，但仍无法充分响应我们的删除与否定干预。