An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-predict pipelines are commonly used in rationale extraction, they both rely on certain heuristics that hinder them from satisfying all three desiderata. In light of this, we propose UNIREX, a flexible learning framework that generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly the train task model and rationale extractor on the task using the selected objectives. UNIREX enables replacing prior works' heuristic design choices with a generic learned rationale extractor in (1) and optimizing it for all three desiderata in (2)-(3). To facilitate comparison between methods with respect to multiple desiderata, we introduce the Normalized Relative Gain (NRG) metric. Across five text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, we find that UNIREX-trained rationale extractors can even generalize to unseen datasets and tasks.
翻译:摘要:抽取式依据通过突出对预测影响最大的文本输入,解释语言模型在给定任务实例上的预测。理想情况下,依据提取应具备忠实性(反映语言模型的实际行为)和合理性(对人类具有说服力),且不损害语言模型(即任务模型)的任务性能。尽管归因算法和“选择-预测”流水线常用于依据提取,但它们都依赖某些启发式方法,从而难以同时满足上述三个要求。为此,我们提出UNIREX,一个灵活的学习框架,将依据提取器的优化泛化为以下步骤:(1) 为学习型依据提取器指定架构;(2) 选择可解释性目标(即忠实性和合理性标准);(3) 在选定目标下联合训练任务模型和依据提取器。UNIREX通过(1)中用通用学习型依据提取器替代先前工作的启发式设计选择,并在(2)-(3)中针对所有三个要求优化该提取器。为便于比较不同方法在多个要求上的表现,我们引入归一化相对增益(NRG)指标。在五个文本分类数据集上,我们最佳的UNIREX配置平均比基线方法高出32.9%的NRG。此外,我们发现经过UNIREX训练的依据提取器甚至可泛化至未见过的数据集和任务。