Common Task Framework For a Critical Evaluation of Scientific Machine Learning Algorithms

Philippe Martin Wyder,Judah Goldfeder,Alexey Yermakov,Yue Zhao,Stefano Riva,Jan P. Williams,David Zoro,Amy Sara Rude,Matteo Tomasetto,Joe Germany,Joseph Bakarji,Georg Maierhofer,Miles Cranmer,J. Nathan Kutz

Machine learning (ML) is transforming modeling and control in the physical, engineering, and biological sciences. However, rapid development has outpaced the creation of standardized, objective benchmarks - leading to weak baselines, reporting bias, and inconsistent evaluations across methods. This undermines reproducibility, misguides resource allocation, and obscures scientific progress. To address this, we propose a Common Task Framework (CTF) for scientific machine learning. The CTF features a curated set of datasets and task-specific metrics spanning forecasting, state reconstruction, and generalization under realistic constraints, including noise and limited data. Inspired by the success of CTFs in fields like natural language processing and computer vision, our framework provides a structured, rigorous foundation for head-to-head evaluation of diverse algorithms. As a first step, we benchmark methods on two canonical nonlinear systems: Kuramoto-Sivashinsky and Lorenz. These results illustrate the utility of the CTF in revealing method strengths, limitations, and suitability for specific classes of problems and diverse objectives. Next, we are launching a competition around a global real world sea surface temperature dataset with a true holdout dataset to foster community engagement. Our long-term vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets that raise the bar for rigor and reproducibility in scientific ML.

翻译：机器学习（ML）正在变革物理、工程与生物科学领域的建模与控制方法。然而，其快速发展已超越标准化客观基准的建立，导致基线薄弱、报告偏差及方法间评估不一致等问题。这损害了研究的可复现性，误导了资源分配，并模糊了科学进展的轨迹。为此，我们提出一种用于科学机器学习的通用任务框架（CTF）。该框架包含一系列精选数据集及面向特定任务的评估指标，涵盖预测、状态重构以及在现实约束（包括噪声与有限数据）下的泛化能力。受自然语言处理与计算机视觉等领域中CTF成功实践的启发，本框架为多种算法的直接比较提供了结构化、严谨的基础。作为初步实践，我们在两个经典非线性系统（Kuramoto-Sivashinsky与Lorenz）上对各类方法进行了基准测试。这些结果展示了CTF在揭示方法优势、局限及其对特定问题类别与多元目标的适用性方面的实用价值。接下来，我们将围绕全球真实海表温度数据集发起一项竞赛，该竞赛包含真实保留数据集以促进学界参与。我们的长期愿景是通过对隐藏测试集的标准化评估取代临时性比较，从而提升科学机器学习领域的严谨性与可复现性标准。