DramaBench：一个用于剧本续写的六维评估框架 (DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation)

Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

翻译：剧本续写要求模型保持角色一致性、推进情节连贯性并保留戏剧结构——这些能力是现有基准测试未能全面评估的。我们提出了DramaBench，这是首个用于评估剧本续写的大型基准测试，涵盖六个独立维度：格式规范、叙事效率、角色一致性、情感深度、逻辑一致性与冲突处理。我们的框架结合了基于规则的分析、基于大语言模型的标注以及统计指标，确保评估的客观性与可复现性。我们在1,103个剧本（总计8,824次评估）上对8个前沿语言模型进行了全面评估，并进行了严格的统计显著性检验（252组配对比较，65.9%具有显著性）及人工验证（188个剧本，在3/5维度上达成实质性一致）。我们的消融研究证实所有六个维度均捕捉了独立的品质特征（平均|r|=0.020）。DramaBench为模型改进提供了可操作的、维度特定的反馈，并为创造性写作评估建立了严谨的标准。