Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code

In the realm of formal theorem proving, the Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness. Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for Large Language Models (LLMs). Addressing this gap, we present a comprehensive dataset specifically designed to enhance LLMs' proficiency in interpreting and generating Coq code. This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. Our primary aim is to facilitate the development of LLMs capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing the frontier of automated theorem proving. Initial experiments with this dataset have showcased its significant potential; models trained on this data exhibited enhanced accuracy in Coq code generation. Notably, a particular experiment revealed that a fine-tuned LLM was capable of generating 141 valid proofs for a basic lemma, highlighting the dataset's utility in facilitating the discovery of diverse and valid proof strategies. This paper discusses the dataset's composition, the methodology behind its creation, and the implications of our findings for the future of machine learning in formal verification. The dataset is accessible for further research and exploration: https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1

翻译：在形式化定理证明领域，Coq证明助手因其对数学断言和软件正确性的严格验证方法而独树一帜。尽管人工智能与机器学习已取得显著进展，但Coq语法与语义的专业性对大型语言模型(LLMs)构成了独特挑战。为弥补这一不足，我们提出一个专门用于提升LLMs在Coq代码解释与生成方面能力的综合数据集。该数据集源自超过10,000个Coq源文件，涵盖广泛的命题、证明与定义，并附有来源引用及许可证信息等元数据。我们的主要目标是通过该数据集推动LLMs生成语法正确且语义有效的Coq结构，从而拓展自动化定理证明的前沿。初步实验表明，该数据集具有显著潜力；基于该数据集训练的模型在Coq代码生成准确率上表现更优。值得注意的是，一项特定实验发现，经过微调的LLM能够为基础引理生成141个有效证明，这凸显了该数据集在促进多样化有效证明策略探索方面的价值。本文讨论了数据集的构成、创建方法及其发现对形式验证领域机器学习发展的启示。该数据集可公开获取以供进一步研究与探索：https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日