Large language models (LLMs) show promise for automating software development by translating requirements into code. However, even advanced prompting workflows like progressive prompting often leave some requirements unmet. Although methods such as self-critique, multi-model collaboration, and retrieval-augmented generation (RAG) have been proposed to address these gaps, developers lack clear guidance on when to use each. In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion, significantly outperforming direct prompting (80.5%, Cohen's d=1.63, p<0.001) but still leaving 8 projects incomplete. For 6 of the most representative projects, we evaluated each enhancement strategy across 4 failure types. Our results reveal that method effectiveness depends critically on failure characteristics: Self-Critique succeeds on code-reviewable logic errors but fails completely on external service integration (0% improvement), while RAG achieves highest completion across all failure types with superior efficiency. Based on these findings, we propose a decision framework that maps each failure pattern to the most suitable enhancement method, giving practitioners practical, data-driven guidance instead of trial-and-error.
翻译:大型语言模型(LLM)通过将需求转化为代码,在自动化软件开发方面展现出潜力。然而,即使是渐进式提示等先进的提示工作流,也常常无法满足某些需求。尽管已提出自我批判、多模型协作和检索增强生成(RAG)等方法来弥补这些不足,但开发人员缺乏关于何时使用每种方法的明确指导。在对25个GitHub项目的实证研究中,我们发现渐进式提示实现了96.9%的平均任务完成率,显著优于直接提示(80.5%,Cohen's d=1.63,p<0.001),但仍有8个项目未完成。针对其中6个最具代表性的项目,我们评估了每种增强策略在4种故障类型上的表现。结果表明,方法的有效性关键取决于故障特征:自我批判对可代码审查的逻辑错误有效,但在外部服务集成方面完全失败(改进率为0%),而RAG在所有故障类型中实现了最高的完成率且效率更优。基于这些发现,我们提出了一个决策框架,将每种故障模式映射到最合适的增强方法,为从业者提供实用、数据驱动的指导,而非试错。