LLM-based agents are rapidly being adopted across diverse domains. Since they interact with users without supervision, they must be tested extensively. Current testing approaches focus on acceptance-level evaluation from the user's perspective. While intuitive, these tests require manual evaluation, are difficult to automate, do not facilitate root cause analysis, and incur expensive test environments. In this paper, we present methods to enable structural testing of LLM-based agents. Our approach utilizes traces (based on OpenTelemetry) to capture agent trajectories, employs mocking to enforce reproducible LLM behavior, and adds assertions to automate test verification. This enables testing agent components and interactions at a deeper technical level within automated workflows. We demonstrate how structural testing enables the adaptation of software engineering best practices to agents, including the test automation pyramid, regression testing, test-driven development, and multi-language testing. In representative case studies, we demonstrate automated execution and faster root-cause analysis. Collectively, these methods reduce testing costs and improve agent quality through higher coverage, reusability, and earlier defect detection. We provide an open source reference implementation on GitHub.
翻译:基于大型语言模型(LLM)的智能体正迅速应用于各个领域。由于它们在没有监督的情况下与用户交互,因此必须进行广泛测试。当前的测试方法主要关注从用户视角进行的验收级评估。这类测试虽然直观,但需要人工评估、难以自动化、不利于根因分析,且测试环境成本高昂。本文提出了实现基于LLM的智能体结构测试的方法。我们的方法利用基于OpenTelemetry的追踪记录智能体执行轨迹,采用模拟技术确保LLM行为的可复现性,并通过断言机制实现测试验证自动化。这使得在自动化工作流中能够从更深技术层面测试智能体组件及其交互。我们展示了结构测试如何使软件工程最佳实践适用于智能体,包括测试自动化金字塔、回归测试、测试驱动开发以及多语言测试。通过代表性案例研究,我们验证了自动化执行与更快速的根因分析能力。总体而言,这些方法通过更高的测试覆盖率、可复用性和早期缺陷检测,降低了测试成本并提升了智能体质量。我们在GitHub上提供了开源参考实现。