Scientific understanding is a fundamental goal of science, allowing us to explain the world. There is currently no good way to measure the scientific understanding of agents, whether these be humans or Artificial Intelligence systems. Without a clear benchmark, it is challenging to evaluate and compare different levels of and approaches to scientific understanding. In this Roadmap, we propose a framework to create a benchmark for scientific understanding, utilizing tools from philosophy of science. We adopt a behavioral notion according to which genuine understanding should be recognized as an ability to perform certain tasks. We extend this notion by considering a set of questions that can gauge different levels of scientific understanding, covering information retrieval, the capability to arrange information to produce an explanation, and the ability to infer how things would be different under different circumstances. The Scientific Understanding Benchmark (SUB), which is formed by a set of these tests, allows for the evaluation and comparison of different approaches. Benchmarking plays a crucial role in establishing trust, ensuring quality control, and providing a basis for performance evaluation. By aligning machine and human scientific understanding we can improve their utility, ultimately advancing scientific understanding and helping to discover new insights within machines.
翻译:科学理解是科学的基本目标,使我们能够解释世界。目前尚无有效方法衡量主体(无论是人类还是人工智能系统)的科学理解水平。缺乏明确的基准,评估和比较不同层次及方法的科学理解便面临挑战。在本路线图中,我们利用科学哲学工具提出了构建科学理解基准的框架。我们采纳行为主义观念,认为真正的理解应被视作执行特定任务的能力。通过设计一系列问题,我们拓展了这一观念——这些问题可衡量不同层次的科学理解,涵盖信息检索、组织信息以形成解释的能力,以及推断不同情境下结果差异的能力。由这些测试构成的科学理解基准(SUB)能够评估和比较不同方法。基准测试在建立信任、确保质量控制以及提供性能评估基础方面发挥着关键作用。通过使机器与人类的科学理解相协调,我们可提升其实用性,最终推动科学理解的进步,并帮助机器发掘新见解。