The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
翻译:大型语言模型的快速发展催生了科学创意产出的激增,然而这一飞跃并未伴随创意评估方法的相应进步。科学评估的本质需要知识基础、集体审议和多标准决策。然而,现有的创意评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM-as-a-Judge的固有偏见。为解决这些问题,我们将创意评估视为一个知识驱动、多视角推理问题,并引入InnoEval——一个旨在模拟人类水平创意评估的深度创新评估框架。我们采用异构深度知识搜索引擎,从多样化的在线资源中检索并动态锚定证据。我们进一步通过组建包含不同学术背景评审员的创新评审委员会达成评审共识,实现跨多个指标的多维度解耦评估。我们基于权威同行评审投稿构建了综合性数据集,以对InnoEval进行基准测试。实验表明,InnoEval在逐点、成对和分组评估任务中均能持续超越基线方法,其判断模式与共识形成过程与人类专家高度吻合。