Scientific discovery begins with ideas, yet evaluating early-stage research concepts is a subtle and subjective human judgment. As large language models (LLMs) are increasingly tasked with generating scientific hypotheses, most systems implicitly assume that scientists' evaluations form a fixed gold standard, assuming that scientists' judgments do not change. Here we challenge this assumption. In a two-wave study with 7,938 ratings from 63 active researchers across six scientific departments, each participant repeatedly evaluated a constant "control" research idea alongside AI-generated ideas. We find that expert evaluations are not stable: test-retest reliability of overall quality is only moderate (ICC~0.59-0.74), indicating substantial within-participant variability even for identical ideas. Yet the internal structure of judgment remained stable, such as the relative importance placed on originality, feasibility, clarity, and other criteria. We then aligned an LLM-based ideation system to first-wave human ratings and used it to select new ideas. Although alignment improved agreement with Wave-1 evaluations, its apparent gains disappeared once drift in human standards was accounted for. Thus, tuning to a fixed human snapshot produced improvements that were transient rather than persistent. These findings reveal that human evaluation of scientific ideas is not static but a dynamic process with stable priorities and requires shifting calibration. Treating one-time human ratings as immutable ground truth risks overstating progress in AI-assisted ideation and obscuring the challenge of co-evolving with changing expert standards. Drift-aware evaluation protocols and longitudinal benchmarks may therefore be essential for building AI systems that reliably augment, rather than overfit to, human scientific judgment.
翻译:科学发现始于思想,然而评估早期研究概念是一种微妙且主观的人类判断。随着大型语言模型(LLMs)越来越多地承担生成科学假设的任务,大多数系统隐含假设科学家的评估构成固定的黄金标准,即认为科学家的判断不会改变。在此,我们挑战这一假设。在一项涉及来自六个科学部门的63名活跃研究人员的7938次评分的两阶段研究中,每位参与者反复评估一个恒定的“对照”研究想法与AI生成的想法。我们发现专家评估并不稳定:整体质量的测试-重测信度仅为中等(ICC~0.59-0.74),表明即使对于相同想法,参与者内部也存在显著变异性。然而,判断的内部结构保持稳定,例如对原创性、可行性、清晰度及其他标准赋予的相对重要性。随后,我们调整了一个基于LLM的构思系统,使其与第一轮人类评分对齐,并利用该系统选择新想法。尽管对齐提高了与第一轮评估的一致性,但在考虑人类标准漂移后,其表面收益消失。因此,调整到固定人类快照所产生的改进是短暂的而非持久的。这些发现表明,对科学想法的人类评估并非静态,而是一个具有稳定优先级且需要动态校准的动态过程。将一次性人类评分视为不可变的真实标准,会夸大AI辅助构思的进展,并掩盖与不断变化的专家标准共同演化的挑战。因此,对漂移敏感的评估协议和纵向基准可能对于构建能够可靠增强而非过度拟合人类科学判断的AI系统至关重要。