As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.
翻译:随着大型语言模型(LLM)从通用知识检索转向复杂科学发现,其评估标准也必须纳入科学探究的严谨规范。现有基准测试存在一个关键盲区:通用指令遵循指标侧重于表面格式,而特定领域的科学基准仅评估最终答案的正确性,这常常会奖励那些通过错误推理得出正确结果的模型。为弥补这一差距,我们提出了科学指令遵循能力:即在严格遵循确立科学有效性的约束条件下解决问题的能力。具体而言,我们引入了SciIF——一个多学科基准测试,通过将大学水平的问题与涵盖三大支柱的固定约束目录相结合来评估该能力:科学条件(如边界检验与假设)、语义稳定性(如单位与符号规范)以及特定流程(如要求的数值方法)。SciIF的独特之处在于强调可审计性,要求模型提供约束满足的显式证据而非隐式遵从。通过同时衡量解答正确性与多约束遵循度,SciIF能够对组合推理失败进行细粒度诊断,从而确保LLM能够在严格的科学逻辑框架内作为可靠智能体发挥作用。