While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs' scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32B-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.
翻译:尽管大型语言模型(LLMs)在科学任务中展现出卓越能力,现有评估框架主要依赖丰富的上下文输入来评估其性能,忽视了其从有限信息中生成新颖创意的能力。我们提出了LiveIdeaBench,这是一个通过单关键词提示来评估LLMs科学创造力与发散思维能力的综合性基准。基于吉尔福德的创造力理论,我们的框架采用动态的前沿LLM专家小组,从四个关键维度对生成的创意进行评估:原创性、可行性、流畅性和灵活性。通过对涵盖18个科学领域的1,180个关键词、20个领先模型进行广泛实验,我们发现科学创造能力展现出与通用智能指标截然不同的模式。值得注意的是,我们的结果表明,尽管在通用智能得分上存在显著差距,像QwQ-32B-preview这样的模型在创造性表现上可与o1-preview等顶级模型相媲美。这些发现凸显了针对科学创造力进行专门评估框架的重要性,并表明LLMs创造能力的发展路径可能与传统问题解决能力的发展轨迹不同。