Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern. Existing studies predominantly concentrate on adversarial attacks or inherent flaws within the models. However, a more prevalent yet underexplored issue concerns how the quality of a benign but poorly formulated prompt affects the security of the generated code. To investigate this, we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency. Based on this framework, we construct and publicly release CWE-BENCH-PYTHON, a large-scale benchmark dataset containing tasks with prompts categorized into four distinct levels of normativity (L0-L3). Extensive experiments on multiple state-of-the-art LLMs reveal a clear correlation: as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases. Furthermore, we demonstrate that advanced prompting techniques, such as Chain-of-Thought and Self-Correction, effectively mitigate the security risks introduced by low-quality prompts, substantially improving code safety. Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
翻译:大型语言模型(LLM)已成为自动化代码生成不可或缺的工具,但其输出质量与安全性仍是关键关切。现有研究主要聚焦于对抗性攻击或模型固有缺陷。然而,一个更普遍且尚未充分探索的问题涉及良性但表述欠佳的提示词质量如何影响生成代码的安全性。为研究此问题,我们首先提出一个涵盖目标清晰度、信息完整性与逻辑一致性三个关键维度的提示词质量评估框架。基于该框架,我们构建并公开发布了CWE-BENCH-PYTHON——一个包含按四个规范性级别(L0-L3)分类的提示词任务的大规模基准数据集。针对多个最先进LLM的广泛实验揭示出明确关联:随着提示词规范性降低,生成不安全代码的概率持续且显著增加。此外,我们证明高级提示技巧(如思维链与自我修正)能有效缓解低质量提示词带来的安全风险,大幅提升代码安全性。我们的发现强调,提升用户提示词质量是强化AI生成代码安全性的关键且有效策略。