Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and are not designed to evaluate fine-grained story aspects, such as fluency and relatedness. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using large pre-trained language models. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across two story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. The findings of our study indicate that the deltascore approach exhibits exceptional performance in evaluating intricate story aspects. An unexpected discovery was made in our experiment, where a single perturbation method was found to effectively capture a majority of these aspects.
翻译:现有自然语言生成任务存在多种评估指标,但这些指标在故事生成中效用有限,主要因为其与人类判断的相关性普遍较弱,且未针对流畅性、相关性等细粒度故事维度进行设计。本文提出DeltaScore方法,通过扰动技术实现故事细粒度维度的评估。核心思想基于以下假设:故事在特定维度(如流畅性)表现越优,其受特定扰动(如引入拼写错误)的影响就越显著。为量化这种影响,我们利用大规模预训练语言模型计算扰动前后故事的似然差异。我们在两个故事领域中将DeltaScore与当前最优的基于模型及传统基于相似度的指标进行对比,并探究其与人类对流畅性、连贯性、相关性、逻辑性和趣味性五个细粒度故事维度判断的相关性。研究表明,DeltaScore在复杂故事维度评估中展现出卓越性能。实验中还发现一个意外结果:单一扰动方法即可有效捕获大部分评估维度。