Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
翻译:人工智能(AI)系统,特别是大型语言模型(LLM),正越来越多地用于科学思想生成等创造性任务,这构成了一种现有概念框架尚未涉及的、从训练数据出发的泛化形式。尽管组合创造力(CC)与组合泛化(CG)有相似之处,但它是一种开放式的能力。我们摒弃了针对固定目标评估准确性或正确性的方法(这与CC的开放式本质相悖),提出了一个理论框架和算法任务,通过输出的新颖性和实用性程度来评估它们。在此基础上,我们做出了几项重要的实证贡献:(1)我们首次获得了关于LLM创造力缩放行为的洞见。(2)我们发现,在固定的计算预算下,存在一个最优的模型深度和宽度以实现创造力。(3)我们发现,LLM擅长生成新颖的科学思想却难以确保其实际可行性的这种"构思-执行鸿沟",可能源于创造力算法中一个更根本的新颖性-实用性权衡特性。重要的是,这种权衡即使在规模扩大后依然存在,这对当前形式的LLM的长期创造潜力提出了质疑。总之,我们的概念框架和实证发现为理解和改进现代AI模型的创造力奠定了基础,弥合了人类与机器智能之间的差距。