The advent of instruction-tuned Large Language Models designed for coding tasks (Code LLMs) has transformed software engineering practices. However, their robustness against various input challenges remains a critical concern. This study introduces DegradePrompter, a novel method designed to systematically evaluate the robustness of instruction-tuned Code LLMs. We assess the impact of diverse input challenges on the functionality and correctness of generated code using rigorous metrics and established benchmarks. Our comprehensive evaluation includes five state-of-the-art open-source models and three production-grade closed-source models, revealing varying degrees of robustness. Open-source models demonstrate an increased susceptibility to input perturbations, resulting in declines in functional correctness ranging from 12% to 34%. In contrast, commercial models demonstrate relatively greater resilience, with performance degradation ranging from 3% to 24%. To enhance the robustness of the models against these vulnerabilities, we investigate a straightforward yet effective mitigation strategy. Our findings highlight the need for robust defense mechanisms and comprehensive evaluations during both the development and deployment phases to ensure the resilience and reliability of automated code generation systems.
翻译:专为编码任务设计的指令微调大语言模型的出现已经改变了软件工程实践。然而,其应对各种输入挑战的鲁棒性仍然是一个关键问题。本研究提出了DegradePrompter,这是一种旨在系统评估指令微调代码大语言模型鲁棒性的新方法。我们使用严格的指标和已建立的基准,评估了多样化的输入挑战对生成代码的功能性和正确性的影响。我们的综合评估涵盖了五个最先进的开源模型和三个生产级的闭源模型,揭示了不同程度的鲁棒性。开源模型表现出对输入扰动更高的敏感性,导致功能正确性下降12%至34%。相比之下,商业模型表现出相对更强的韧性,性能下降范围为3%至24%。为了增强模型针对这些漏洞的鲁棒性,我们研究了一种简单而有效的缓解策略。我们的研究结果强调了在开发和部署阶段都需要强大的防御机制和全面评估,以确保自动化代码生成系统的韧性和可靠性。