APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work -- informativeness, simplification, coherence, and faithfulness -- and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to extractive hypotheses for two PLS datasets to form our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics. APPLS and our evaluation code is available at https://github.com/LinguisticAnomalies/APPLS.

翻译：尽管简明语言摘要（PLS）模型已取得显著进展，但其评估仍面临挑战。PLS领域缺乏专用的评估指标，且由于涉及独特的文本转换（例如添加背景解释、去除专业术语），现有文本生成评估指标的适用性尚不明确。为应对这些问题，本研究提出了一个细粒度的元评估测试平台APPLS，专门用于评估PLS指标的有效性。我们基于前人研究确定了PLS的四项核心标准——信息量、简化程度、连贯性与忠实度——并定义了一组与这些标准对应的文本扰动模式，有效的评估指标应能敏感地检测这些扰动。我们将这些扰动应用于两个PLS数据集的抽取式摘要假设，构建了测试平台。通过APPLS，我们评估了14种指标的性能，包括自动化评分、词汇特征以及基于大语言模型提示的评估方法。分析表明，虽然现有部分指标对特定标准具有敏感性，但尚无单一方法能同时涵盖全部四项标准。因此，我们建议采用组合式自动化指标来全面捕捉PLS在各项相关标准上的质量。本工作贡献了首个PLS元评估测试平台，并对现有指标进行了系统性评估。APPLS平台及相关评估代码已公开于https://github.com/LinguisticAnomalies/APPLS。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日