Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code

In the realm of formal theorem proving, the Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness. Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for Large Language Models (LLMs). Addressing this gap, we present a comprehensive dataset specifically designed to enhance LLMs' proficiency in interpreting and generating Coq code. This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. Our primary aim is to facilitate the development of LLMs capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing the frontier of automated theorem proving. Initial experiments with this dataset have showcased its significant potential; models trained on this data exhibited enhanced accuracy in Coq code generation. Notably, a particular experiment revealed that a fine-tuned LLM was capable of generating 141 valid proofs for a basic lemma, highlighting the dataset's utility in facilitating the discovery of diverse and valid proof strategies. This paper discusses the dataset's composition, the methodology behind its creation, and the implications of our findings for the future of machine learning in formal verification. The dataset is accessible for further research and exploration: https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1

翻译：在形式化定理证明领域，Coq证明助手因其对数学断言和软件正确性的严谨验证方法而独树一帜。尽管人工智能和机器学习取得了进展，但Coq语法和语义的专业性给大型语言模型（LLMs）带来了独特挑战。为弥合这一差距，我们提出了一个专门用于提升LLMs解释和生成Coq代码能力的综合数据集。该数据集源自超过10,000个Coq源文件，包含大量命题、证明和定义，并附有包括来源引用和许可信息在内的元数据。我们的主要目标是促进能够生成语法正确且语义有意义的Coq结构的LLMs的开发，从而推进自动化定理证明的前沿。初步实验表明该数据集具有巨大潜力：基于此数据训练的模型在Coq代码生成方面展现出更高的准确性。值得注意的是，一项特定实验显示，经过微调的LLM能够为一个基本引理生成141个有效证明，凸显了该数据集在促进多样化有效证明策略发现方面的实用性。本文讨论了数据集的构成、创建方法以及我们的发现对形式化验证中机器学习未来发展的启示。该数据集可供进一步研究和探索：https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日