Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.
翻译:现有的自然语言生成任务评估指标在泛化能力和可解释性方面面临挑战。具体而言,大多数性能优异的指标需要在特定NLG任务和评估维度的数据集上进行训练,这可能导致对任务特定数据集的过拟合。此外,现有指标仅提供每个维度的评估分数,而不揭示该分数如何获得的证据。为应对这些挑战,我们提出了一种简单而有效的指标名为DecompEval。该指标将NLG评估构建为指令式问答任务,并利用经过指令微调的预训练语言模型而无需在评估数据集上进行训练,旨在增强泛化能力。为使评估过程更具可解释性,我们将所设计的关于生成文本质量的指令式问题分解为衡量每个句子质量的子问题,然后将PLMs生成的子问题及其答案重新组合为证据以获得评估结果。实验结果表明,DecompEval在文本摘要和对话生成的未训练指标评估中达到最佳性能,同时展现出强大的维度级/任务级泛化能力和可解释性。