We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we introduce a granular rubric-based evaluation framework to assess model capabilities throughout the process of solving a research task, rather than judging only a standalone final answer.
翻译:我们提出了前沿科学基准,用于评估前沿语言模型在专家级科学推理方面的能力。当前模型的发展已使现有科学基准近乎饱和,这些基准通常依赖于多项选择知识题或已发表信息。前沿科学基准通过两个互补的轨道弥补这一不足:(1) 奥林匹克轨道,包含达到国际物理奥林匹克、化学奥林匹克和生物奥林匹克水平的国际奥赛题目;(2) 研究轨道,包含代表科学研究中子任务的博士级开放式问题。该基准包含数百道题目(其中160道属于开源黄金集),涵盖从量子电动力学到合成有机化学等物理学、化学和生物学的多个子领域。所有奥林匹克题目均由国际奥赛奖牌得主及国家队教练原创命题,以确保其难度、原创性与事实性标准。所有研究题目均由博士科学家(博士候选人、博士后研究员或教授)撰写并验证。针对研究轨道,我们引入了一个基于细粒度量规的评估框架,以评估模型在整个研究任务解决过程中的能力,而非仅评判独立的最终答案。