When Agent A delegates to Agent B, which invokes Tool C on behalf of User X, no existing framework can answer: whose authorization chain led to this action, and where did it violate policy? This paper introduces SentinelAgent, a formal framework for verifiable delegation chains in federal multi-agent AI systems. The Delegation Chain Calculus (DCC) defines seven properties - six deterministic (authority narrowing, policy preservation, forensic reconstructibility, cascade containment, scope-action conformance, output schema conformance) and one probabilistic (intent preservation) - with four meta-theorems and one proposition establishing the practical infeasibility of deterministic intent verification. The Intent-Preserving Delegation Protocol (IPDP) enforces all seven properties at runtime through a non-LLM Delegation Authority Service. A three-point verification lifecycle achieves 100% combined TPR at 0% FPR on DelegationBench v4 (516 scenarios, 10 attack categories, 13 federal domains). Under black-box adversarial conditions, the DAS blocks 30/30 attacks with 0 false positives. Deterministic properties are unbreakable under adversarial stress testing; intent verification degrades to 13% against sophisticated paraphrasing. Fine-tuning the NLI model on 190 government delegation examples improves P2 from 1.7% to 88.3% TPR (5-fold cross-validated, F1=82.1%). Properties P1, P3-P7 are mechanically verified via TLA+ model checking across 2.7 million states with zero violations. Even when intent verification is evaded, the remaining six properties constrain the adversary to permitted API calls, conformant outputs, traceable actions, bounded cascades, and compliant behavior.
翻译:摘要:当智能体A委派给智能体B,而后者代表用户X调用工具C时,现有框架无法回答:这一操作源自谁的授权链,又在何处违反了策略?本文提出SentinelAgent,一个用于联邦多智能体AI系统中可验证委派链的形式化框架。委派链演算(DCC)定义了七项性质——六项确定性性质(权限收窄、策略保持、取证可重构性、级联约束、作用域-动作一致性、输出模式一致性)和一项概率性性质(意图保持)——同时给出四个元定理和一个命题,论证确定性意图验证在实际中不可行的结论。意图保持委派协议(IPDP)通过非LLM的委派授权服务(DAS)在运行时强制执行全部七项性质。一个三点验证生命周期在DelegationBench v4上(516个场景、10类攻击、13个联邦领域)实现了100%的复合真阳性率和0%假阳性率。在黑盒对抗条件下,DAS以零误报阻断30/30次攻击。在对抗压力测试下,确定性性质不可攻破;面对复杂释义攻击,意图验证的TPR降至13%。基于190个政府委派示例对NLI模型进行微调,将P2的TPR从1.7%提升至88.3%(五折交叉验证,F1=82.1%)。性质P1、P3-P7通过TLA+模型检查在270万种状态上实现机械验证,零违规。即使意图验证被规避,其余六项性质仍能将对手约束在许可的API调用、合规输出、可追溯操作、有限级联及合规行为范围内。