AI agents must be evaluated as behavioral systems, not as isolated response generators. They reason across turns, call tools, preserve context, follow policies, and act under uncertainty. Existing methods provide useful but fragmented signals: benchmarks measure fixed capabilities, Human-in-the-Loop review preserves expert judgment but does not scale easily, LLM-as-judge methods depend on evaluator design, red teaming is often episodic, and trace auditing requires explicit evidence rules. This paper introduces Human-on-the-Bridge (HOB), a scalable evaluation paradigm for agentic AI. HOB places human expertise upstream, where experts curate reusable evaluation intelligence before testing begins, including domain context, Red-Team Traps, Juror Personas, scoring guidelines, audit rules, and fallback policies. ProofAgent Harness then executes this curated intelligence repeatedly through multi-turn adversarial evaluations, trace capture, multi-juror scoring, and evidence-linked reporting. We evaluate HOB through symmetric and cost-efficient asymmetric settings across frontier LLM-based agents and Harness LLM tiers. The study covers 23,500 agent turns and produces evidence-linked findings across finance, healthcare, and code generation. The results show that HOB can amplify evaluation quality without requiring equally large evaluator models, allowing smaller Harness LLMs to challenge agents built on frontier LLM backbones. The evaluation surfaces failures often missed by static benchmarks and single-evaluator scoring, including phantom tool-call claims, missing mandatory tool calls, policy drift, manipulation paths, and safe but non-resolving refusals. These findings support HOB as a paradigm for scaling human-curated evaluation intelligence, where expert judgment is encoded upfront and reused across repeated agent evaluations rather than applied manually inside every run.
翻译:AI智能体必须作为行为系统而非孤立响应生成器进行评估。它们需要跨轮次推理、调用工具、维持上下文、遵循策略并在不确定性下行动。现有方法提供了有价值但碎片化的评估信号:基准测试衡量固定能力,人机协同审核保留了专家判断但难以扩展,LLM作为裁判的方法依赖评估器设计,红队测试往往具有间歇性,而轨迹审核需要明确的证据规则。本文提出Human-on-the-Bridge(HOB),一种面向智能体AI的可扩展评估范式。HOB将人类专业知识置于上游阶段,专家在测试开始前策划可复用的评估智能,包括领域上下文、红队陷阱、陪审员角色画像、评分准则、审计规则及回退策略。ProofAgent测试框架随后通过多轮对抗性评估、轨迹捕获、多陪审员评分及证据关联报告,反复执行这些预置的评估智能。我们通过对称和成本优化的非对称设置,对基于前沿LLM的智能体及测试框架LLM层级进行了评估。研究涵盖23,500个智能体对话轮次,在金融、医疗及代码生成领域产出了证据关联的评估发现。结果表明,HOB无需同等规模的评估模型即可提升评估质量,使得较小规模的测试框架LLM能够挑战基于前沿LLM架构构建的智能体。该评估揭示了静态基准测试和单一评估者评分常遗漏的失败模式,包括虚拟工具调用声明、遗漏必需工具调用、策略漂移、操纵路径以及安全但不解决用户问题的拒绝响应。这些发现支撑了HOB作为扩展人类策划评估智能的范式,其中专家判断被预先编码并在多次智能体评估中复用,而非每次运行中手动应用。