Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.
翻译:研究者们认为,大语言模型从博客到小说均展现了高质量的写作能力。然而,客观评估文本的创造性仍具挑战性。受托伦斯创造性思维测验(TTCT)这一基于过程评估创造力的方法启发,我们采用共识评估技术[3],提出托伦斯创造性写作测验(TTCW)以评估作为产品的创造力。TTCW包含14项二元测试,涵盖流畅性、灵活性、原创性和详细化四个原始维度。我们招募10名创意作家,通过TTCW对48篇由专业作家或大语言模型创作的短篇小说进行人工评估。分析表明,大语言模型生成的故事通过TTCW测试的比例比专业作家作品低3-10倍。此外,我们探索利用LLM作为评估者实现TTCW自动评估的可能性,发现没有任何LLM的评估结果与专家评估呈正相关。