Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and do not measure fine-grained story aspects, such as fluency versus relatedness, as they are intended to assess overall generation quality. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using a language model. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across multiple story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. Our results demonstrate that deltascore performs impressively in evaluating fine-grained story aspects, and we discovered a striking outcome where a specific perturbation appears to be highly effective in measuring most aspects.
翻译:现有自然语言生成任务的评估指标众多,但它们在故事生成方面的效用有限——这些指标通常与人类判断的相关性不高,且无法衡量如流畅性与相关性等细粒度故事维度,因为它们旨在评估整体生成质量。本文提出DeltaScore方法,通过利用扰动来评估细粒度故事维度。其核心思想基于如下假设:故事在特定维度(如流畅性)上表现越佳,受特定扰动(如引入拼写错误)的影响就越大。为量化这一影响,我们使用语言模型计算扰动前后的故事似然差异。我们评估了DeltaScore在多个故事域中与现有最先进的基于模型和传统基于相似度的指标的表现,并研究了它与人类对五个细粒度故事维度(流畅性、连贯性、相关性、逻辑性和趣味性)判断的相关性。结果表明,DeltaScore在评估细粒度故事维度方面表现出色,并且我们发现了一个显著现象:特定扰动在衡量大多数维度时极为有效。