Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.
翻译:长篇连载广播剧(剧情弧线跨越200至800集)是重要的创意媒介,也是前沿大语言模型(LLM)表现欠佳的场景。我们基于一组统一的结构化叙事指标,对涵盖经典、微调、开放前沿、封闭前沿及推理层级的21个模型进行了基准测试。所有封闭前沿系统在情节节拍F1值上均饱和于区间[0.78, 0.81],并在时间跨度h=200时下降约-0.20 F1值。我们推出了NarrativeWorldBench——一个开放基准测试,该基准在h∈{10, 20, 50, 100, 200}的时间跨度下评估九项叙事结构指标,并包含跨四种印度语言(印地语、泰米尔语、泰卢固语、马拉地语)的跨语言评测。我们提出了N-VSSM(叙事变分状态空间模型),该模型通过基于Mamba-2骨干网络、事件条件后验及8B解码器的结构,在200余集范围内维持一个256维的结构化潜在世界状态。N-VSSM在所有时间跨度下均能保持情节节拍F1值≥0.84,且计算量仅为封闭前沿模型组的1/4。学习型文化迁移函数将跨语言保真度提升了+0.20至+0.23 Likert分。在一项受试者内作家研究(n=12位专业作者,240次试验)中,N-VSSM在长篇剧情一致性方面以71%的偏好率优于Claude Opus 4.5,并在可控性方面获得高出+1.3 Likert分的评分。