As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a "first half to second half" continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.
翻译:随着大语言模型(LLMs)在创意写作领域的应用日益广泛,其在文化特定叙事任务上的表现值得系统研究。本研究构建了首个中文电影剧本续写基准,涵盖53部经典影片,并设计了一个多维评估框架,用于比较GPT-5.2和Qwen-Max-Latest。采用“前半部分续写后半部分”的范式,每部影片生成3个样本,共获得303个有效样本(GPT-5.2:157个,有效率98.7%;Qwen-Max:146个,有效率91.8%)。评估整合了ROUGE-L、结构相似性以及LLM-as-Judge评分(使用DeepSeek-Reasoner)。对144个配对样本的统计分析表明:Qwen-Max在ROUGE-L得分上略高(0.2230 vs 0.2114,d=-0.43);然而,GPT-5.2在结构保持性(0.93 vs 0.75,d=0.46)、整体质量(44.79 vs 25.72,d=1.04)和综合得分(0.50 vs 0.39,d=0.84)上均显著优于Qwen-Max。整体质量的效应量达到大效应水平(d>0.8)。GPT-5.2在角色一致性、风格匹配和格式保持方面表现优异,而Qwen-Max在生成稳定性方面存在不足。本研究为中文创意写作领域的LLM评估提供了一个可复现的框架。