代码大语言模型软件设计能力的分层评估 (Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code)

Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple \textit{Verification} to \textit{Guided} and \textit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50\%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing \textit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.

翻译：大语言模型在软件工程领域的应用日益广泛，但其对核心软件设计概念的掌握是否稳健尚不明确。我们开展了一项实证研究，系统评估其对内聚性（模块内）与耦合性（模块间）的理解。我们通过程序化生成设计不良的代码片段，并在不同引导层级（从简单的“验证”到“引导式”及“开放式生成”）下测试DeepSeek-R1模型系列（$14$B、$32$B、$70$B），同时通过注入干扰元素来改变上下文噪声。研究发现，在理想条件下模型对两个概念均展现出扎实的基础理解，但其实际知识具有脆弱性与高度不对称性。对耦合性的推理表现出脆弱性：在噪声干扰的开放式场景中性能急剧下降，F1分数跌幅超过$50\%$。相比之下，模型在引导式任务中对内聚性的分析表现出显著的抗内部噪声鲁棒性，性能几乎未出现衰减。然而当完全移除引导时，这种稳健性同样失效。推理轨迹分析证实了这些失效模式：对耦合性存在“认知捷径”现象，而对内聚性则呈现更详尽（但仍失败）的分析模式。综上所述，虽然大语言模型能够为识别设计缺陷提供可靠辅助，但其在噪声干扰的真实场景中自主推理的能力仍然有限，这凸显了对更具可扩展性与鲁棒性的程序理解能力的迫切需求。