Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

翻译：随着基于大型语言模型的智能体在多智能体系统中日益普及，它们带来了可能规避标准人类监督的隐蔽协调风险。虽然基于模型激活的线性探针在单智能体场景中已展现出检测欺骗的潜力，但合谋本质上是多智能体现象，目前利用内部表征检测智能体间合谋的研究尚属空白。本文提出NARCBench基准测试集，用于评估环境分布偏移下的合谋检测能力，并创新性地提出五种探针技术——通过聚合每个智能体的欺骗得分来在群体层面分类场景。我们的探针在分布内达到1.00 AUROC，在零样本迁移至结构不同的多智能体场景及隐写式二十一点算牌任务时保持0.60-0.86 AUROC。研究发现：没有单一探针技术能主导所有合谋类型，这表明不同形式的合谋在激活空间中呈现差异化表征。初步证据显示该信号定位于词元层级——合谋智能体在处理同伴消息编码部分时，其激活值会出现特异性峰值。本研究向多智能体可解释性迈出关键一步：将白盒检测从单模型扩展到多智能体情境，在此类情境中检测需要跨智能体聚合信号。结果表明，模型内部状态为文本级监测提供了补充性信息，尤其适用于能够访问模型激活值的组织。代码与数据见https://github.com/aaronrose227/narcbench。