Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.
翻译:人类创造力在大语言模型时代已成为一项关键能力。在复杂开放环境中评估创造力是数据挖掘领域的重大挑战,当前受限于对标准化简单任务的依赖以及细粒度专家数据的匮乏。作为生态效度良好的评估场景,辩论不仅蕴含发散性思维与收敛性思维的创造力多维度表征,更具备数据丰富的特性——可获取大量公开辩论材料。现有主流自动化评分方法难以适应辩论等复杂场景,仍依赖成本高昂的人工评估。为此,本文提出DEFINED——一种辩论场景下细粒度创造力评估的数据高效计算框架。DEFINED通过层级式八维指标体系将辩论创造力操作化,采用带层级评分头的预训练自回归语言模型实现,支持细粒度与粗粒度评估。从真实辩论竞赛中获取陈述及其关联专家评分,并采用约束数据增强策略处理原始数据中的精英偏差。该框架采用混合粒度训练策略,使模型得以从经培训的研究生专家标注的有限细粒度监督中实现稳健学习。为严格验证超越合成基准的生态效度,我们纳入包含辩论新手参与者的实证研究,利用这些真实数据作为中低水平群体的定性案例。在评估协议中,我们的评分模型实现了准确稳定的评分,性能优于基于提示的大语言模型评估器及现有的辩论评分方法。