LLM-based agents are becoming central to software engineering tasks, yet evaluating them remains fragmented and largely model-centric. Existing studies overlook how architectural components, such as planners, memory, and tool routers, shape agent behavior, limiting diagnostic power. We propose a lightweight, architecture-informed approach that links agent components to their observable behaviors and to the metrics capable of evaluating them. Our method clarifies what to measure and why, and we illustrate its application through real world agents, enabling more targeted, transparent, and actionable evaluation of LLM-based agents.
翻译:基于大语言模型的智能体正日益成为软件工程任务的核心,然而对其评估仍呈现碎片化且主要围绕模型本身展开。现有研究忽视了架构组件(如规划器、记忆模块和工具路由器)如何塑造智能体行为,从而限制了评估的诊断能力。我们提出一种轻量级、架构感知的方法,将智能体组件与其可观测行为以及能够评估这些行为的指标联系起来。该方法明确了应测量什么以及为何测量,并通过实际智能体案例展示了其应用,从而实现对基于大语言模型的智能体进行更具针对性、透明度和可操作性的评估。