Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software. We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.
翻译:大型语言模型(LLM)已彻底改变了智能应用的开发范式。尽管独立的LLM无法执行任何操作,但LLM智能体通过集成工具解决了这一局限。然而,由于该领域仍处于早期阶段且社区生态尚未成熟,调试LLM智能体既困难又成本高昂。为深入理解智能体开发过程中遭遇的缺陷,我们首次对基于LLM智能体的软件中存在的缺陷类型、根本原因及影响进行了系统性研究。我们从Stack Overflow、GitHub和Hugging Face论坛收集并分析了1,187个缺陷相关帖文与代码片段,聚焦于使用七种主流LLM框架及自定义实现构建的LLM智能体。为进行深度分析,我们还研究了缺陷发生的功能模块,并统计了编程语言与框架的分布情况。本研究进一步探讨了自动化缺陷识别的可行性。为此,我们构建了名为BugReAct的ReAct智能体,为其配备完备的外部工具链,以评估其检测与标注数据集中缺陷的能力。实验表明:搭载Gemini 2.5 Flash的BugReAct在缺陷特征标注任务中表现卓越,平均每帖/代码片段的处理成本仅为0.01美元。