Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models' reasoning and prediction processes.
翻译:生成自由文本理由是迈向可解释自然语言处理的一个有前景的步骤,但评估此类理由仍是一个挑战。现有指标大多侧重于衡量理由与给定标签之间的关联性。我们认为,理想的指标应聚焦于理由中独特提供的、而输入或标签中未包含的新信息。我们从信息论视角出发,利用条件V信息(Hewitt等人,2021)研究这一研究问题。具体而言,我们提出一种名为REV(基于条件V信息的理由评估)的指标,用于量化理由中超出输入或标签已有信息的、与标签相关的新信息量。在包含思维链推理任务的四个基准上的实验表明,与现有指标相比,REV在评估理由-标签对方面的有效性。我们进一步证明,REV在理由评估上与人类判断一致,并为自由文本理由中的新信息提供更灵敏的测量。当与传统性能指标配合使用时,REV可提供对模型推理及预测过程的更深入洞察。