Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

翻译：生成式人工智能能够将科学文章转化为面向不同受众的叙事文本，但如何评估这些故事仍具挑战性。叙事创作需要抽象化、简化和教学创造力——这些特质通常无法被标准摘要评估指标充分捕捉。同时，事实性幻觉在科学语境中至关重要，然而现有检测器常将合理的叙事重构误判为幻觉，或在涉及创造性表达时表现不稳定。本研究提出StoryScore——一个用于评估AI生成科学故事的复合型指标。StoryScore将语义对齐、词汇锚定、叙事控制、结构保真度、冗余规避及实体级幻觉检测整合进统一框架。我们的分析还揭示了为何众多幻觉检测方法难以区分教学创造力与事实错误，并指出其核心局限：虽然自动评估指标能有效衡量生成内容与原文的语义相似性，却难以评估其叙事方式与控制机制。

相关内容

关注 7107

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

大型语言模型赋能科研创意生成：创造力导向的研究综述

专知会员服务

19+阅读 · 2025年11月13日

【新书】生成式人工智能：概念与应用

专知会员服务

47+阅读 · 2025年3月18日

【AI4Science】利用大型语言模型变革科学：关于人工智能辅助科学发现、实验、内容生成与评估的调研

专知会员服务

33+阅读 · 2025年2月10日

视觉中的生成物理人工智能：综述

专知会员服务

36+阅读 · 2025年1月26日