Code smells are symptoms of potential code quality problems that may affect software maintainability, thus increasing development costs and impacting software reliability. Large language models (LLMs) have shown remarkable capabilities for supporting various software engineering activities, but their use for detecting code smells remains underexplored. However, unlike the rigid rules of static analysis tools, LLMs can support flexible and adaptable detection strategies tailored to the unique properties of code smells. This paper evaluates the effectiveness of four LLMs -- DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code -- for detecting nine code smells across 30 Java projects. For the empirical evaluation, we created a ground-truth dataset by asking 76 developers to manually inspect 268 code-smell candidates. Our results indicate that LLMs perform strongly for structurally straightforward smells, such as Large Class and Long Method. However, we also observed that different LLMs and tools fare better for distinct code smells. We then propose and evaluate a detection strategy that combines LLMs and static analysis tools. The proposed strategy outperforms LLMs and tools in five out of nine code smells in terms of F1-Score. However, it also generates more false positives for complex smells. Therefore, we conclude that the optimal strategy depends on whether Recall or Precision is the main priority for code smell detection.
翻译:代码坏味是影响软件可维护性、增加开发成本并降低软件可靠性的潜在代码质量问题征兆。大型语言模型已在支持多种软件工程活动中展现出卓越能力,但其在代码坏味检测中的应用仍未得到充分探索。与静态分析工具的刚性规则不同,LLM能够根据代码坏味的独特属性支持灵活且自适应的检测策略。本文评估了四种LLM——DeepSeek-R1、GPT-5 mini、Llama-3.3和Qwen2.5-Code——在30个Java项目中检测九种代码坏味的效果。为进行实证评估,我们邀请了76名开发者手动检查268个候选代码坏味,构建了基准真相数据集。结果表明,LLM在结构简单的坏味(如过大类与过长方法)检测中表现优异。但我们也观察到不同LLM和工具对不同代码坏味的检测效果存在差异。基于此,我们提出并评估了一种结合LLM与静态分析工具的混合检测策略。在九种代码坏味中,该策略在F1分数上于五种坏味上优于单独使用LLM或工具的效果。然而,对于复杂坏味,该策略也会产生更多误报。因此,我们得出结论:最佳策略的选择取决于代码坏味检测的首要目标是召回率还是精确率。