Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.
翻译:文本摘要作为自然语言生成(NLG)的核心任务,在各领域至关重要。然而,在风险敏感型应用(尤其是涉及人在环决策的场景)中,摘要不准确的高昂代价引发了对文本摘要不确定性估计(UE-TS)评估方法可靠性的担忧。这一问题的根源在于不确定性模型度量对多样化且可能相互冲突的NLG指标的依赖性。为解决该问题,我们构建了一个涵盖四个维度共31项NLG指标的综合UE-TS基准。该基准在两个大型语言模型和一个预训练语言模型上,基于三个数据集评估其不确定性估计能力,并在适用场景中引入人工标注分析。我们还在该基准框架内评估了14种常用不确定性估计方法的性能。研究结果表明,必须综合考虑多个不相关的NLG指标与多样化的不确定性估计方法,才能确保UE-TS技术评估的可靠性与有效性。