We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the s-expressions give complete and easily parsed information about every entry. We report baseline results using standard graph and word embeddings, tree ensembles, and instance-based learning algorithms. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. The methodology used to extract the networks and the s-expressions readily applies to other libraries, and is applicable to other proof assistants. With more than $250\,000$ entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.
翻译:我们引入MLFMF,这是一组用于评测支撑基于证明助手的数学形式化推荐系统的数据集。这些系统帮助人类识别哪些先前的条目(定理、构造、数据类型和公设)与证明新定理或执行新构造相关。每个数据集均源自使用证明助手Agda或Lean编写的形式化数学库。该集合包含最大的Lean~4库Mathlib,以及一些最大的Agda库:标准库、单值数学库Agda-unimath和TypeTopology库。每个数据集以两种方式表示对应库:作为异质网络,以及作为表示库中所有条目语法树的S表达式列表。网络包含库的(模块化)结构及条目间的引用关系,而S表达式则提供关于每个条目的完整且易于解析的信息。我们使用标准图嵌入、词嵌入、树集成和基于实例的学习算法报告了基线结果。MLFMF数据集为后续研究众多面向形式化数学的机器学习方法提供了坚实的评测支持。用于提取网络和S表达式的方法可直接应用于其他库,并适用于其他证明助手。该数据集包含总计超过25万个条目,是目前以机器学习可读格式存储的最大形式化数学知识集合。