人工智能代理系统：架构、应用与评估 (AI Agent Systems: Architectures, Applications, and Evaluation)

AI agents -- systems that combine foundation models with reasoning, planning, memory, and tool use -- are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs -- latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability -- and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

翻译：人工智能代理——将基础模型与推理、规划、记忆及工具使用相结合的系统——正迅速成为连接自然语言意图与现实世界计算的实际接口。本综述综合梳理了人工智能代理架构的新兴格局，涵盖：(i) 审思与推理（如思维链式分解、自我反思与验证、约束感知决策），(ii) 规划与控制（从反应式策略到分层多步规划器），以及(iii) 工具调用与环境交互（检索、代码执行、API及多模态感知）。我们将现有研究整合为统一分类体系，涵盖代理组件（策略/LLM核心、记忆、世界模型、规划器、工具路由器和评判器）、编排模式（单代理与多代理；集中式与去中心化协调）以及部署场景（离线分析与在线交互辅助；安全关键型与开放式任务）。我们探讨了关键的设计权衡——延迟与准确性、自主性与可控性、能力与可靠性——并指出评估工作因非确定性、长周期信用分配、工具与环境可变性以及重试和上下文增长等隐性成本而变得复杂。最后，我们总结了测量与基准测试实践（任务套件、人类偏好与效用指标、约束下的成功率、鲁棒性与安全性），并指出了包括工具动作的验证与防护、可扩展的记忆与上下文管理、代理决策的可解释性，以及真实工作负载下的可复现评估在内的开放挑战。