Factuality is important to dialogue summarization. Factual error correction (FEC) of model-generated summaries is one way to improve factuality. Current FEC evaluation that relies on factuality metrics is not reliable and detailed enough. To address this problem, we are the first to manually annotate a FEC dataset for dialogue summarization containing 4000 items and propose FERRANTI, a fine-grained evaluation framework based on reference correction that automatically evaluates the performance of FEC models on different error categories. Using this evaluation framework, we conduct sufficient experiments with FEC approaches under a variety of settings and find the best training modes and significant differences in the performance of the existing approaches on different factual error categories.
翻译:事实性对对话摘要至关重要。对模型生成的摘要进行事实错误修正(FEC)是提升事实性的一种途径。当前依赖事实性指标的FEC评估方法不可靠且不够细致。为解决该问题,我们首次人工标注了包含4000条数据的对话摘要FEC数据集,并提出FERRANTI——一种基于参考修正的细粒度评估框架,可自动评估FEC模型在不同错误类别上的表现。利用该评估框架,我们在多种设置下对FEC方法进行了充分实验,发现了最佳训练模式,并揭示了现有方法在不同事实错误类别上性能的显著差异。