Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation

Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8$\sim$30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios cross 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6$\sim$91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.

翻译：代码即基础设施（IaC）生成在自动化云基础设施配置方面具有重要前景。大型语言模型（LLM）的最新进展为通过自然语言描述生成可部署的基础设施模板提供了民主化IaC开发的新机遇。然而，现有评估主要关注语法正确性而忽视了可部署性——这一衡量IaC配置文件实用性的关键指标。六种前沿LLM在可部署性方面表现不佳，首次尝试的部署成功率仅为20.8$\sim$30.2%。本文构建了首个以部署为中心的IaC模板基准DPIaC-Eval，包含跨58种独特服务的153个真实场景。同时，我们提出了基于LLM的以部署为中心的框架IaCGen，该框架采用包含格式验证、语法检查和实时部署阶段的迭代反馈机制，从而紧密模拟真实的DevOps工作流。实验结果表明，IaCGen能使所有评估模型生成的IaC模板在前10次迭代中实现54.6$\sim$91.6%的可部署率。此外，提供可部署性错误直接指导的人机协同反馈机制，可进一步将所有评估LLM的性能提升至超过90%的passItr@25指标。进一步地，我们探究了生成IaC模板在用户意图对齐和安全合规性方面的可信度。其较差的表现（25.2%的用户需求覆盖率和8.4%的安全合规率）表明该领域仍需持续深入研究。