Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Though our findings persist up to the 100M scale, frontier models today are well into the billions of parameters. Therefore, our conceptual framework and empirical findings can best serve as a starting point for understanding and improving the creativity of frontier-size models today, as we begin to bridge the gap between human and machine intelligence.
翻译:人工智能(AI)系统,特别是大型语言模型(LLM),正越来越多地用于科学创意生成等创造性任务,这构成了一种现有概念框架尚未涉及的、从训练数据中泛化的形式。尽管组合创造力(CC)与组合泛化(CG)有相似之处,但CC是一种开放式的能力。我们提出了一种理论框架和算法任务,通过输出的新颖性和实用性程度来评估它们,而不是根据固定目标评估准确性或正确性(这与CC的开放式本质相悖)。基于此,我们做出了几项重要的实证贡献:(1)我们首次获得了关于LLM创造力缩放行为的洞见。(2)我们发现,在固定的计算预算下,存在一个最优的模型深度和宽度以实现创造力。(3)我们发现,LLM在生成新颖科学创意方面表现出色,但难以确保其实际可行性的这种“构思-执行差距”,可能由创造力算法中一个更根本的新颖性-实用性权衡特性所解释。尽管我们的发现在高达1亿参数规模上仍然成立,但当今的前沿模型参数已高达数十亿。因此,我们的概念框架和实证发现可以作为一个起点,用于理解和改进当今前沿规模模型的创造力,从而开始弥合人类与机器智能之间的差距。