Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents (i.e., RFCs) and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.
翻译:攻击者控制输入的错误解析是软件安全漏洞的主要来源,特别是当程序员将RFC中的非正式格式描述转录为低级、非内存安全语言的高效解析逻辑时。已有研究者提出用于数据格式的形式化规范语言,能够从中提取高效代码。然而,将非正式需求提炼为形式化规范颇具挑战性,且新形式化语言虽有其优势,但学习与使用门槛较高。本文提出3DGen框架,利用AI智能体将混合的非正式输入(包括自然语言文档即RFC和示例输入)转化为名为3D的格式规范。为帮助用户理解并信任生成的规范,3DGen采用符号化方法合成可经由外部预言机验证的测试输入。符号测试生成还有助于区分多个合理解决方案。通过反复精炼过程,3DGen生成符合测试套件的3D规范,并产出安全、高效、可证明正确的C语言解析代码。我们在20种互联网标准格式上评估了3DGen,展示了AI智能体在非平凡规模下生成形式化验证C代码的潜力。关键使能技术在于使用领域特定语言,将AI输出约束在可通过自动化符号分析处理的类别中。