Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.
翻译:自动系统越来越多地被用于评估创意任务中回答的原创性。它们为人类评估的关键局限(成本、疲劳和主观性)提供了潜在解决方案,但有初步证据表明存在一种自我偏好偏差。据此,自动系统倾向于偏好更接近其自身风格而非人类风格的结果。在本文中,我们研究了大型语言模型在评估发散思维任务回答原创性时与人类评分者的一致性。我们分析了由高低创造力人类与ChatGPT-4o产生的4,813条对“替代用途任务”的回答。人类评分者为两名经过强化培训的大学生。机器评分者为两个在AUT回答及对应人类评分上微调的特化系统(OCSAI与CLAUS)以及ChatGPT-4o,后者被施以与人类评分者相同的指令。结果证实了大型语言模型中存在自我偏好偏差。自动系统倾向于偏爱人工回答。然而,当分析控制了创意构思精细度后,这种自我偏好偏差消失了。我们通过强调创造力评估研究的未来方向,讨论了这些发现的理论与方法学启示。