Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.
翻译:生成式人工智能能够将科学文章转化为面向不同受众的叙事文本,但如何评估这些故事仍具挑战性。叙事创作需要抽象化、简化和教学创造力——这些特质通常无法被标准摘要评估指标充分捕捉。同时,事实性幻觉在科学语境中至关重要,然而现有检测器常将合理的叙事重构误判为幻觉,或在涉及创造性表达时表现不稳定。本研究提出StoryScore——一个用于评估AI生成科学故事的复合型指标。StoryScore将语义对齐、词汇锚定、叙事控制、结构保真度、冗余规避及实体级幻觉检测整合进统一框架。我们的分析还揭示了为何众多幻觉检测方法难以区分教学创造力与事实错误,并指出其核心局限:虽然自动评估指标能有效衡量生成内容与原文的语义相似性,却难以评估其叙事方式与控制机制。