RobuNFR：评估大型语言模型在非功能性需求感知代码生成中的鲁棒性 (RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation)

When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

翻译：当使用大型语言模型处理非功能性需求时，开发人员的行为可能存在差异（例如，使用不同措辞表达相同的非功能性需求）。鲁棒的大型语言模型应能在这些变体中输出一致的结果，然而这一方面仍未得到充分探索。我们提出RobuNFR，通过提示词变体、回归测试和多样化工作流三种方法，在非功能性需求感知的代码生成中，针对设计、可读性、可靠性和性能这四个非功能性需求维度，评估大型语言模型的鲁棒性。实验表明，在考虑非功能性需求的代码生成场景中，RobuNFR揭示了所测试大型语言模型存在的鲁棒性问题。具体而言，在提示词变体测试下，与不考虑非功能性需求的基线相比，纳入非功能性需求导致Pass@1指标下降高达39%，标准差从0.48增至2.48。虽然纳入非功能性需求通常能提升整体非功能性需求指标，但也导致更高的提示词敏感性。在回归测试场景中，部分大型语言模型在不同版本间表现出差异，某一方面的改进常伴随另一方面的退化，这种不一致性对其鲁棒性构成挑战。在工作流变体测试中，所测试的大型语言模型在两种工作流间展现出显著不同的非功能性需求感知代码生成能力：（1）将非功能性需求与功能性需求整合至初始提示词；（2）对仅考虑功能性需求生成的代码进行相同非功能性需求的增强。