While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing specialized terminology). To address these concerns, our study presents a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We define a set of perturbations along four criteria inspired by previous work that a PLS metric should capture: informativeness, simplification, coherence, and faithfulness. An analysis of metrics using our testbed reveals that current metrics fail to capture simplification consistently. In response, we introduce POMME, a new metric designed to assess text simplification in PLS; the metric is calculated as the normalized perplexity difference between an in-domain and out-of-domain language model. We demonstrate POMME's correlation with fine-grained variations in simplification and validate its sensitivity across 4 text simplification datasets. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics. The APPLS testbed and POMME is available at https://github.com/LinguisticAnomalies/APPLS.
翻译:尽管平面语言摘要(PLS)模型开发取得了显著进展,但其评估仍面临挑战。PLS缺乏专用的评估指标,且由于其涉及独特的转换(如添加背景解释、删除专业术语),现有文本生成评估指标的适用性尚不明确。为解决这些问题,本研究提出了一个细粒度元评估测试平台APPLS,专门用于评估PLS指标。我们依据前人研究,从四个标准定义了一组扰动:信息性、简化性、连贯性和忠实性——这些是PLS指标应捕捉的关键维度。利用该测试平台对现有指标的分析表明,当前指标无法一致地捕捉简化性。为此,我们提出POMME——一种专为评估PLS中文本简化程度设计的新指标;该指标通过计算领域内与领域外语言模型之间的归一化困惑度差值获得。我们展示了POMME与简化细粒度变异的相关性,并在4个文本简化数据集上验证了其敏感性。本研究贡献了首个PLS元评估测试平台,并系统评估了现有指标。APPLS测试平台与POMME已开源于https://github.com/LinguisticAnomalies/APPLS。