Code smells are symptoms of potential code quality problems that may affect software maintainability, thus increasing development costs and impacting software reliability. Large language models (LLMs) have shown remarkable capabilities for supporting various software engineering activities, but their use for detecting code smells remains underexplored. However, unlike the rigid rules of static analysis tools, LLMs can support flexible and adaptable detection strategies tailored to the unique properties of code smells. This paper evaluates the effectiveness of four LLMs -- DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code -- for detecting nine code smells across 30 Java projects. For the empirical evaluation, we created a ground-truth dataset by asking 76 developers to manually inspect 268 code-smell candidates. Our results indicate that LLMs perform strongly for structurally straightforward smells, such as Large Class and Long Method. However, we also observed that different LLMs and tools fare better for distinct code smells. We then propose and evaluate a detection strategy that combines LLMs and static analysis tools. The proposed strategy outperforms LLMs and tools in five out of nine code smells in terms of F1-Score. However, it also generates more false positives for complex smells. Therefore, we conclude that the optimal strategy depends on whether Recall or Precision is the main priority for code smell detection.
翻译:代码异味是潜在代码质量问题的征兆,可能影响软件可维护性,从而增加开发成本并损害软件可靠性。大语言模型(LLMs)在支持各类软件工程活动方面已展现出卓越能力,但其在代码异味检测中的应用仍待深入探索。与静态分析工具的刚性规则不同,LLMs能够支持灵活且可适配的检测策略,以适应代码异味的独特属性。本文评估了四种LLMs——DeepSeek-R1、GPT-5 mini、Llama-3.3和Qwen2.5-Code——在30个Java项目中检测九种代码异味的有效性。为进行实证评估,我们邀请76名开发者人工检查268个候选代码异味样本,构建了基准数据集。结果表明,LLMs在结构直观的异味(如Large Class和Long Method)检测中表现优异。然而,我们也发现不同LLMs和工具对特定代码异味的检测效果存在差异。在此基础上,我们提出并评估了一种结合LLMs与静态分析工具的检测策略。该策略在九种异味中的五种上,其F1分数超越了单一LLMs或工具。但该策略对复杂异味会产生更多误报。因此,我们得出结论:最优检测策略的选择取决于代码异味检测中是以召回率还是精确度作为首要优化目标。