Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects. Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations (e.g., the introduction of typos). Given this, we measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models. We compare DELTASCORE with existing metrics on storytelling datasets from two domains in five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. DELTASCORE demonstrates remarkable performance, revealing a surprising finding that a specific perturbation proves highly effective in capturing multiple aspects.
翻译:针对自然语言生成任务已开发出众多评估指标,但这些指标在故事评估中的有效性有限,因为它们并非专门针对叙事流畅性、趣味性等复杂层面设计。本文提出DELTASCORE这一创新方法,采用扰动技术对故事的细微层面进行评估。核心假设是:故事在特定维度(如流畅性)上的表现优劣,与其对该维度对应扰动(如引入拼写错误)的敏感程度成正比。基于此,我们通过计算预训练语言模型在扰动前后状态的概率差异来量化该维度的质量。我们在两个领域的故事数据集上,将DELTASCORE与现有指标就流畅性、连贯性、相关性、逻辑性和趣味性五个细粒度维度进行对比。结果表明,DELTASCORE展现出卓越性能,并揭示了一个令人惊讶的发现:特定扰动在捕捉多个维度方面具有高度有效性。