Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

from arxiv, 32 pages, 2 figures, 6 tables. Verus/TLA+ verification artifact, reference Rust runtime, and Python harnesses, plus a supplementary appendix (Sections A-F, Tables S1-S6), included as ancillary files

Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.

翻译：多智能体大语言模型系统通过内存存储、向量索引和工具注册表共享状态。我们将这种共享建模为在确定性生成语义下的长期读-生成-写操作——持久化执行引擎通过确定性重放强制执行该机制——并在TLA+中形式化了四种并发异常：陈旧生成、幻影工具、因果级联和工具效应重排序，这些是经典隔离异常的结构类比，每种均提供了TLC反例。这些异常上的排除格是平凡的；贡献在于机械验证了其中一条最大链$L_0 \subsetneq \cdots \subsetneq L_4$的可实现性与严格分离——据我们所知，这是此类运行时首个经机器检查的一致性层次结构。通过274个Verus义务（零假设、零认可；信任基础：两条结构公理和一个互斥对应关系）的开发，我们证明了检测器相对于规范的可靠性和完备性，以及每个运行时其规避集。三个已部署的Rust运行时实现了L0-L1（悲观锁、可序列化快照隔离、默认SI），每个均针对陈旧生成进行了验证并精化为其状态机；L2-L4通过执行模式验证并附带无依赖的预防孪生技术（A3、A6、A2：0/1000对比1000/1000），L2在三个模型家族中实时运行（所有120个回滚会话中A3均被预防）。我们重现了字节跳动deer-flow中的静默丢失更新，将其修复形式化为已验证的$L_0 \to L_1$精化，并在LangGraph的ToolNode中对未修改输出展示了工具效应重排序（该问题已通过L3提交顺序排序器消除）。已验证的检测器、精化方案和可实现性工件是贡献所在；相关现象和格结构则是经典内容。