AI for science (AI4S) is an emerging research field that aims to enhance the accuracy and speed of scientific computing tasks using machine learning methods. Traditional AI benchmarking methods struggle to adapt to the unique challenges posed by AI4S because they assume data in training, testing, and future real-world queries are independent and identically distributed, while AI4S workloads anticipate out-of-distribution problem instances. This paper investigates the need for a novel approach to effectively benchmark AI for science, using the machine learning force field (MLFF) as a case study. MLFF is a method to accelerate molecular dynamics (MD) simulation with low computational cost and high accuracy. We identify various missed opportunities in scientifically meaningful benchmarking and propose solutions to evaluate MLFF models, specifically in the aspects of sample efficiency, time domain sensitivity, and cross-dataset generalization capabilities. By setting up the problem instantiation similar to the actual scientific applications, more meaningful performance metrics from the benchmark can be achieved. This suite of metrics has demonstrated a better ability to assess a model's performance in real-world scientific applications, in contrast to traditional AI benchmarking methodologies. This work is a component of the SAIBench project, an AI4S benchmarking suite. The project homepage is https://www.computercouncil.org/SAIBench.
翻译:AI for Science(AI4S)是一个新兴研究领域,旨在利用机器学习方法提升科学计算任务的准确性和速度。传统AI基准测试方法难以适应AI4S带来的独特挑战,因为这些方法假设训练数据、测试数据和未来真实场景中的查询数据是独立同分布的,而AI4S任务则面临分布外问题实例的预期。本文以机器学习力场(MLFF)为案例,探讨采用新颖方法有效衡量AI4S的必要性。MLFF是一种能以低计算成本和高精度加速分子动力学(MD)模拟的方法。我们识别了科学意义基准测试中错失的多种机会,并提出评估MLFF模型的解决方案,具体涵盖样本效率、时域敏感性和跨数据集泛化能力等维度。通过构建贴近实际科学应用的问题实例化方案,基准测试可获得更具科学意义的性能指标。与传统AI基准测试方法论相比,该指标套件展现出更优的模型实际科学应用性能评估能力。本研究是SAIBench项目(一个AI4S基准测试套件)的组成部分,项目主页为https://www.computercouncil.org/SAIBench。