Recently, large language models (LLMs) are extensively utilized to enhance development efficiency, leading to numerous benchmarks for evaluating their performance. However, these benchmarks predominantly focus on implementation, overlooking the equally critical aspect of software design. This gap raises two pivotal questions: (1) Can LLMs handle software design? (2) Can LLMs write code following the specific designs? To investigate these questions, this paper proposes DesBench, a design-aware benchmark for evaluating LLMs on three software design-related tasks: design-aware code generation, object-oriented modeling, and the design of acceptance test cases. DesBench comprises 30 manually crafted Java projects that include requirement documents, design models, implementations, and acceptance tests, amounting to a total of 30 design models, 194 Java classes, and 737 test cases. We evaluated seven state-of-the-art LLMs, including three DeepSeek R1, two Qwen2.5, and two GPT models, using DesBench. The results reveal that LLMs remain significantly challenged by the intricacies of software design: (1) For code generation, LLMs struggle to produce correct implementations when provided with only high-level or no designs. (2) In object-oriented modeling, while LLMs can accurately identify objects and classes, they face challenges in defining operations and inter-class relationships. (3) Acceptance test cases generated by LLMs from functional requirements achieve code coverage quality comparable to those written by humans. Our research highlights the current limitations of LLMs in managing software design and calls for further investigation into new design methodologies and languages suitable for LLM-based development.
翻译:近年来,大型语言模型(LLMs)被广泛应用于提升开发效率,催生了众多评估其性能的基准测试。然而,这些基准测试主要关注代码实现,忽视了同样关键的软件设计环节。这一空白引出了两个关键问题:(1)LLMs能否处理软件设计?(2)LLMs能否依据特定设计编写代码?为探究这些问题,本文提出了DesBench——一个面向设计的基准测试,用于评估LLMs在三个软件设计相关任务上的表现:设计感知的代码生成、面向对象建模以及验收测试用例设计。DesBench包含30个手工构建的Java项目,涵盖需求文档、设计模型、实现代码和验收测试,共计包含30个设计模型、194个Java类和737个测试用例。我们使用DesBench评估了包括三个DeepSeek R1、两个Qwen2.5和两个GPT模型在内的七种前沿LLMs。结果表明,LLMs在处理软件设计的复杂性方面仍面临显著挑战:(1)在代码生成任务中,当仅提供高层级设计或无设计信息时,LLMs难以生成正确的实现。(2)在面向对象建模任务中,LLMs虽能准确识别对象和类,但在定义操作和类间关系方面存在困难。(3)LLMs根据功能需求生成的验收测试用例,其代码覆盖质量可达到与人工编写测试相当的水平。本研究揭示了LLMs在管理软件设计方面的当前局限性,并呼吁进一步探索适用于基于LLM开发的新型设计方法与语言。