Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.
翻译:评估创造性文本生成仍具挑战性,因为现有基于参考指标的评估方法难以捕捉创造力的主观特性。我们提出了一个结构化的人工智能故事生成评估框架,包含四个核心维度(新颖性、价值性、契合度、共鸣度)及十一个子维度。通过采用"尖峰提示"技术进行受控故事生成,并结合115位读者参与的众包研究,我们探究了不同创造性维度如何影响人类即时性与反思性创造力评判。研究发现:创造力评估呈现层级性而非累积性特征,不同维度在评判的不同阶段具有显著差异性;反思性评估会显著改变评分结果及评分者间一致性。这些结果共同证实了本框架在揭示被基于参考指标的评估方法所遮蔽的创造力维度方面的有效性。