Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.
翻译:人类创造力在大语言模型时代已成为一项关键能力。在复杂开放环境中评估创造力是数据挖掘领域的一项重大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的匮乏。作为具有生态效度的评估场景,辩论反映了包含发散性思维与聚合性思维在内的多维度创造力。此外,辩论是一个数据丰富的领域,拥有大量可公开获取的材料。当前主流自动化评分方法难以适应辩论等复杂场景,因而仍依赖成本高昂的人工评估。为此,本文提出DEFINED——一种面向辩论场景的细粒度创造力评估的数据高效计算框架。DEFINED通过层级式八维指标体系对辩论创造力进行可操作化定义,该体系借助配备层级化评分头的预训练自回归语言模型实现,支持细粒度与粗粒度两种评估模式。我们从真实辩论竞赛中获取发言文本及其对应的专家评分,并采用受约束数据增强策略以解决原始数据中存在的精英偏差问题。DEFINED采用混合粒度训练策略,使其能从经培训毕业生专家标注的有限细粒度监督数据中进行稳健学习。为严格验证超越合成基准的生态效度,我们纳入一项针对辩论初学者的实证研究,利用这些真实数据作为中低水平人群的定性案例分析。在我们的评估协议中,该评分模型实现了准确且稳定的评分,性能优于基于提示词的大语言模型评估器及现有辩论评分方法。