Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
翻译:大型语言模型正日益被用于评估和预测研究思想,然而我们缺乏可扩展的方法来评估模型对这些科学思想判断的质量。为此,我们提出PoT,一个半可验证的基准框架,将科学思想判断与后续可观测的信号(如引用量和研究者议程的转变)相关联。PoT在离线沙箱中冻结截止时间前的证据快照,要求模型预测截止时间后的结果,从而在真实结果出现时实现可验证的评估、无需大量专家标注的可扩展基准测试,以及针对同行评议奖项等信号的人机对齐偏差分析。此外,PoT为基于智能体的科学思想评估研究提供了一个受控测试平台,在提示消融和预算规模变化的条件下,比较使用工具的智能体与非智能体基线。在涵盖四个基准领域的超过30,000个实例中,我们发现,与非智能体基线相比,更高的交互预算通常能提升智能体性能,而工具使用的效益则高度依赖于具体任务。通过将时间分区、未来可验证的目标与离线工具使用沙箱相结合,PoT支持对未来导向的科学思想判断任务中的智能体进行可扩展评估。