LiveIdeaBench：基于最小上下文的LLM科学创造力与创意生成评估 (LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context)

While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs' scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32B-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.

翻译：尽管大型语言模型（LLMs）在科学任务中展现出卓越能力，现有评估框架主要依赖丰富的上下文输入来评估其性能，忽视了其从有限信息中生成新颖创意的能力。我们提出了LiveIdeaBench，这是一个通过单关键词提示来评估LLMs科学创造力与发散思维能力的综合性基准。基于吉尔福德的创造力理论，我们的框架采用动态的前沿LLM专家小组，从四个关键维度对生成的创意进行评估：原创性、可行性、流畅性和灵活性。通过对涵盖18个科学领域的1,180个关键词、20个领先模型进行广泛实验，我们发现科学创造能力展现出与通用智能指标截然不同的模式。值得注意的是，我们的结果表明，尽管在通用智能得分上存在显著差距，像QwQ-32B-preview这样的模型在创造性表现上可与o1-preview等顶级模型相媲美。这些发现凸显了针对科学创造力进行专门评估框架的重要性，并表明LLMs创造能力的发展路径可能与传统问题解决能力的发展轨迹不同。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日