Autonomous agents deployed in regulated domains must produce a verification artifact per consequential output: a record an auditor can re-execute offline, capturing what was claimed, against what source, by whom, when, and how. Production verification today splits into two unstandardized halves. Probabilistic verdict patterns (self-consistency voting, reviewer LLM ensembles) produce judgments, not artifacts. Artifact-producing patterns (RAG, tool-augmented traces, generator-verifier loops) produce vendor-specific records no external auditor can reconstruct without bespoke integration. Pramana defines the missing wire format. Every consequential agent output is wrapped in a typed ClaimAttestation with one of four variants (measurement, inference, analogy, citation), each paired with a verify() operation against the recorded source. verify() is deterministic for MeasurementClaim and CitationClaim. For InferenceClaim and AnalogyClaim, determinism is conditional on the oracle (audit-replayable when LLM-backed). The four-way typology derives from classical Indian epistemology (pramana, valid means of knowledge). The lifecycle is specified in TLA+ and exhaustively verified under TLC across three symmetry-reduced models: 38,563 distinct reachable states, zero invariant violations. The Python reference implementation passes 84 tests. An A2A and MCP wire-extension manifest layers three deployment-grade invariants: reachability, SLA bound, and offline re-verifiability. An exploratory pilot (n=100, 2,275 reviewer calls) probes LLM-as-judge in code generation. The strongest observation is a 40-percentage-point raw FPR delta across corpora, consistent with reference-solution quality contributing significantly. The pilot does not validate Pramana on its own; the structural argument and formal verification do that.
翻译:在受监管领域部署的自主智能体,必须为每个关键输出生成可验证的制品:一份可供审计员离线重放的记录,准确捕获所声明的具体内容、信息来源、声明主体、时间戳及验证方式。当前的生产级验证体系分裂为两个未标准化的分支。概率性裁决模式(自一致性投票、评审者大模型集成)产生的是判决结果而非可验证制品;而制品生成模式(检索增强生成、工具增强轨迹、生成器-验证器循环)产出的是厂商特有记录,外部审计员需通过定制化集成才能重构验证过程。Pramana定义了缺失的传输协议格式:每个关键智能体输出被封装为带类型的ClaimAttestation,包含四种变体(测量、推理、类比、引用),每种变体均配有针对记录源执行的verify()操作。对于MeasurementClaim和CitationClaim,verify()具有确定性;而对于InferenceClaim与AnalogyClaim,确定性取决于预言机(当由大模型驱动时可实现审计重放)。四维分类体系源于古印度知识论(pramana,有效认知手段)。其生命周期以TLA+形式化规约,经TLC模型检测器在三个对称约减模型上全面验证:共发现38,563个可达状态,零不变违反。Python参考实现通过84项测试。A2A与MCP协议扩展清单封装了三个部署级不变量:可达性、服务等级协议边界及离线重验证能力。探索性试点实验(n=100,2,275次评审调用)检验了大模型作为裁判在代码生成中的表现,最显著的观测结果为跨语料库原始假阳性率差值达40个百分点,该差异与参考解质量高度相关。需注意,该试点并非对Pramana自身的验证——其结构性论证与形式化验证方构成核心证据。