LLM-integrated software, which embeds or interacts with large language models (LLMs) as functional components, exhibits probabilistic and context-dependent behaviors that fundamentally differ from those of traditional software. This shift introduces a new category of integration defects that arise not only from code errors but also from misaligned interactions among LLM-specific artifacts, including prompts, API calls, configurations, and model outputs. However, existing defect localization techniques are ineffective at identifying these LLM-specific integration defects because they fail to capture cross-layer dependencies across heterogeneous artifacts, cannot exploit incomplete or misleading error traces, and lack semantic reasoning capabilities for identifying root causes. To address these challenges, we propose LIDL, a multi-agent framework for defect localization in LLM-integrated software. LIDL (1) constructs a code knowledge graph enriched with LLM-aware annotations that represent interaction boundaries across source code, prompts, and configuration files, (2) fuses three complementary sources of error evidence inferred by LLMs to surface candidate defect locations, and (3) applies context-aware validation that uses counterfactual reasoning to distinguish true root causes from propagated symptoms. We evaluate LIDL on 146 real-world defect instances collected from 105 GitHub repositories and 16 agent-based systems. The results show that LIDL significantly outperforms five state-of-the-art baselines across all metrics, achieving a Top-3 accuracy of 0.64 and a MAP of 0.48, which represents a 64.1% improvement over the best-performing baseline. Notably, LIDL achieves these gains while reducing cost by 92.5%, demonstrating both high accuracy and cost efficiency.
翻译:LLM集成软件通过嵌入或交互大型语言模型(LLM)作为功能组件,其行为具有概率性和上下文依赖性,与传统软件存在本质差异。这种转变引入了一类新的集成缺陷,这些缺陷不仅源于代码错误,还产生于提示词、API调用、配置参数和模型输出等LLM特有构件之间的交互错位。然而,现有缺陷定位技术难以有效识别这类LLM特有的集成缺陷,因为它们无法捕捉异构构件间的跨层依赖关系,难以利用不完整或误导性的错误轨迹,且缺乏识别根本原因的语义推理能力。为解决这些挑战,我们提出LIDL——一个面向LLM集成软件缺陷定位的多智能体框架。LIDL具备以下特性:(1)构建增强LLM感知标注的代码知识图谱,表征源代码、提示词与配置文件间的交互边界;(2)融合由LLM推断的三种互补错误证据源,以呈现候选缺陷位置;(3)应用基于反事实推理的上下文感知验证机制,区分真实根本原因与传播性症状。我们在从105个GitHub仓库和16个基于智能体的系统中收集的146个真实缺陷实例上评估LIDL。实验结果表明,LIDL在所有指标上显著优于五种先进基线方法,其Top-3准确率达到0.64,平均准确率均值(MAP)为0.48,较最佳基线提升64.1%。值得注意的是,LIDL在实现性能提升的同时将成本降低92.5%,展现出高准确性与成本效益的平衡。