Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents (i.e., RFCs) and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.
翻译:摘要:对攻击者控制输入的不当解析是软件安全漏洞的主要来源,尤其是在程序员将RFC中的非正式格式描述转译为低层级、内存不安全语言的高效解析逻辑时。已有研究者提出从数据格式形式化规约语言中提取高效代码的方法。然而,将非正式需求提炼为形式化规约极具挑战性,且尽管新形式化语言具有优势,人们仍难以学习和使用。本研究提出3DGen框架,该框架利用AI智能体将混合非正式输入(包括自然语言文档(如RFC)和示例输入)转化为名为3D语言的格式规约。为帮助人类理解并信任生成的规约,3DGen采用符号方法综合生成可经由外部预言机验证的测试输入。符号测试生成还有助于区分多种可行解。通过反复精炼过程,3DGen生成符合测试套件的3D规约,并据此产生安全、高效、可证明正确的C语言解析代码。我们在20个互联网标准格式上评估了3DGen,展示了AI智能体以非平凡规模生成经过形式化验证的C代码的潜力。其关键使能技术在于使用领域特定语言将AI输出限定在可通过自动化符号分析处理的类别范围内。