APPLS: Evaluating Evaluation Metrics for Plain Language Summarization

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing specialized terminology). To address these concerns, our study presents a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We define a set of perturbations along four criteria inspired by previous work that a PLS metric should capture: informativeness, simplification, coherence, and faithfulness. An analysis of metrics using our testbed reveals that current metrics fail to capture simplification consistently. In response, we introduce POMME, a new metric designed to assess text simplification in PLS; the metric is calculated as the normalized perplexity difference between an in-domain and out-of-domain language model. We demonstrate POMME's correlation with fine-grained variations in simplification and validate its sensitivity across 4 text simplification datasets. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics. The APPLS testbed and POMME is available at https://github.com/LinguisticAnomalies/APPLS.

翻译：尽管平面语言摘要（PLS）模型开发取得了显著进展，但其评估仍面临挑战。PLS缺乏专用的评估指标，且由于其涉及独特的转换（如添加背景解释、删除专业术语），现有文本生成评估指标的适用性尚不明确。为解决这些问题，本研究提出了一个细粒度元评估测试平台APPLS，专门用于评估PLS指标。我们依据前人研究，从四个标准定义了一组扰动：信息性、简化性、连贯性和忠实性——这些是PLS指标应捕捉的关键维度。利用该测试平台对现有指标的分析表明，当前指标无法一致地捕捉简化性。为此，我们提出POMME——一种专为评估PLS中文本简化程度设计的新指标；该指标通过计算领域内与领域外语言模型之间的归一化困惑度差值获得。我们展示了POMME与简化细粒度变异的相关性，并在4个文本简化数据集上验证了其敏感性。本研究贡献了首个PLS元评估测试平台，并系统评估了现有指标。APPLS测试平台与POMME已开源于https://github.com/LinguisticAnomalies/APPLS。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日