Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
翻译:剧本续写要求模型保持角色一致性、推进情节连贯性并保留戏剧结构——这些能力是现有基准测试未能全面评估的。我们提出了DramaBench,这是首个用于评估剧本续写的大型基准测试,涵盖六个独立维度:格式规范、叙事效率、角色一致性、情感深度、逻辑一致性与冲突处理。我们的框架结合了基于规则的分析、基于大语言模型的标注以及统计指标,确保评估的客观性与可复现性。我们在1,103个剧本(总计8,824次评估)上对8个前沿语言模型进行了全面评估,并进行了严格的统计显著性检验(252组配对比较,65.9%具有显著性)及人工验证(188个剧本,在3/5维度上达成实质性一致)。我们的消融研究证实所有六个维度均捕捉了独立的品质特征(平均|r|=0.020)。DramaBench为模型改进提供了可操作的、维度特定的反馈,并为创造性写作评估建立了严谨的标准。