While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. This is in part because PLS involves multiple, interrelated language transformations (e.g., adding background explanations, removing specialized terminology). No metrics are explicitly engineered for PLS, and the suitability of other text generation evaluation metrics remains unclear. To address these concerns, our study presents a granular meta-evaluation testbed, APPLS, designed to evaluate existing metrics for PLS. Drawing on insights from previous research, we define controlled perturbations for our testbed along four criteria that a metric of plain language should capture: informativeness, simplification, coherence, and faithfulness. Our analysis of metrics using this testbed reveals that current metrics fail to capture simplification, signaling a crucial gap. In response, we introduce POMME, a novel metric designed to assess text simplification in PLS. We demonstrate its correlation with simplification perturbations and validate across a variety of datasets. Our research contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics, offering insights with relevance to other text generation tasks.
翻译:尽管通俗语言摘要(PLS)模型的开发取得了显著进展,但其评估仍面临挑战。这在一定程度上是因为通俗语言摘要涉及多种相互关联的语言转换(例如增加背景解释、删除专业术语)。目前尚无专门为通俗语言摘要设计的评估指标,而其他文本生成评估指标的适用性也不明确。为解决上述问题,本研究提出了一个细粒度元评估测试平台APPLS,旨在评估现有通俗语言摘要指标。基于以往研究见解,我们沿通俗语言指标应覆盖的四项标准(信息量、简化度、连贯性、忠实度)为测试平台定义了受控扰动。利用该测试平台对指标进行分析后发现,现有指标未能捕捉简化度,暴露出关键缺口。为此,我们提出新型指标POMME,专门用于评估通俗语言摘要中的文本简化。实验证明其与简化扰动存在相关性,并在多种数据集上得到验证。本研究贡献了首个通俗语言摘要元评估测试平台,并对现有指标进行了全面评估,其发现对其他文本生成任务具有参考价值。