ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents

AI agents are entering high-risk production settings, where they use tools, retain context, follow policies, handle private data, and interact with users over multiple turns. Yet many evaluation methods still judge isolated outputs or static tasks, missing failures that emerge through trajectory, pressure, and adversarial interaction. We introduce ProofAgent Harness, open infrastructure for scalable, auditable, and adversarial AI agent evaluation. The harness provides evaluation infrastructure around an agent: it curates evaluation intelligence, runs adversarial multi-turn trials, captures behavioral traces, applies post-hoc multi-juror scoring, resolves disagreement, and produces evidence-linked reports. Its open design allows developers and researchers to extend domains, traps, metrics, juror personas, scoring rules, and reporting formats. At its core is Adversarial Multi-Juror Scoring with Turn-Level Audit, which evaluates completed agent behavior under pressure using calibrated juror personas, consensus checks, and turn-level evidence. Experiments across customer support, medical triage, privacy and security, and code generation agents show that strong agents fail selectively through weak metrics, fragile turns, unsafe reframing, and manipulation paths. We also find that a small quantized local Harness LLM can challenge production agents powered by best-in-class large LLMs, suggesting that evaluation capability emerges from the full harness pipeline rather than model scale alone. ProofAgent Harness turns AI agent evaluation from a static score into scalable adversarial evaluation infrastructure: repeatable, evidence-backed, extensible, and actionable before deployment.

翻译：AI代理正进入高风险生产环境，在此类场景中，它们使用工具、保留上下文、遵循策略、处理私有数据，并通过多轮交互与用户协作。然而，许多评估方法仍判断孤立输出或静态任务，忽略了因轨迹演化、施压机制与对抗性互动而产生的失败模式。我们提出ProofAgent Harness，一种用于可扩展、可审计且具备对抗性的AI代理评估的开放基础设施。该测试平台为代理提供评估基础设施：它整合评估知识库、运行对抗性多轮试验、捕获行为轨迹、实施事后多评审员评分、处理分歧，并生成附有证据链的报告。其开放设计允许开发者和研究人员扩展领域、陷阱、指标、评审员角色、评分规则及报告格式。其核心是带有轮次级审计的对抗性多评审员评分机制，该机制利用校准后的评审员角色、共识检查及轮次级证据，评估代理在压力下的完整行为。在客户支持、医疗分诊、隐私与安全以及代码生成代理上的实验表明，强代理会因脆弱指标、脆弱轮次、不安全重述及操纵路径而选择性失败。我们还发现，一个小型量化本地Harness大模型能够挑战由顶级大型大模型驱动的生产级代理，这表明评估能力源自完整的Harness流水线，而非仅依赖模型规模。ProofAgent Harness将AI代理评估从静态得分转化为可扩展的对抗性评估基础设施：可重复、有证据支持、可扩展，且在部署前具备可操作性。