As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.
翻译:摘要:随着人工智能生成小说的日益普及,关于作者身份与原创性的问题正逐渐成为评估书面作品的核心。尽管当前该领域的大多数研究聚焦于识别AI写作的表面层签名特征,但我们转而探究:在不依赖文体风格信号的前提下,AI生成的故事是否仍能与人类创作的故事相区分?本研究重点关注话语层面的叙事选择,例如角色能动性与时间错序。我们提出StoryScope框架,该流程可自动提取一个细粒度、可解释的话语层叙事特征空间,涵盖10个维度。我们将StoryScope应用于一个包含10,272条写作提示的平行语料库(每条提示分别由一位人类作者和五个大语言模型创作),生成了61,608个故事(每个故事约5,000词),并从每个故事中提取304个特征。仅凭叙事特征,在人机识别任务中可达93.2%的宏F1值,在六类作者归属任务中达68.4%的宏F1值,保留包含文体线索模型性能的97%以上。一个由30个核心叙事特征构成的紧凑集合捕获了大部分信号:AI故事倾向于过度解释主题并偏好清晰单一的情节线索,而人类故事则将主角的选择塑造得更具道德模糊性,时间复杂性也更高。各模型的指纹特征实现了六类归属:例如,Claude生成的事件升级幅度明显平缓,GPT过度依赖梦境序列,Gemini则默认采用外部角色描写。我们发现在叙事空间中,AI生成的故事聚集于一个共享区域,而人类创作的故事则展现出更大的多样性。更广泛而言,这些结果表明,深层叙事建构的差异(而不仅仅是写作风格)可用于区分人类原创作品与AI生成小说。