Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context

While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.

翻译：尽管大语言模型在科学任务（如文献分析和实验设计）中展现出卓越能力（例如，能准确提取论文核心发现或生成连贯的实验流程），但现有评估基准主要依赖丰富上下文输入来评测模型性能。我们提出了LiveIdeaBench——一个通过单关键词提示评估发散思维能力，从而全面评测大语言模型科学创意生成的基准。基于吉尔福德创造力理论，本基准采用动态的前沿大语言模型专家小组，从原创性、可行性、流畅性、灵活性、清晰度五个核心维度对生成创意进行评估。通过对覆盖22个科学领域的1,180个关键词、超过40个领先模型进行广泛实验，我们发现：通过本基准测得的科学创意生成能力，与通用智能的标准评估指标预测结果关联甚微。实验结果表明，尽管QwQ-32B-preview等模型在通用智能评分上与顶级模型（如claude-3.7-sonnet:thinking）存在显著差距，但其创意生成表现却可与之媲美。这些发现凸显了针对科学创意生成开发专项评估基准的必要性，并表明提升大语言模型的创意生成能力可能需要不同于改进通用问题解决能力的训练策略，从而有望为科学过程的不同阶段开发更广泛的专业化AI工具。