System prompts for LLM-based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand-labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. One scourer finding was structural data loss in Gemini CLI's memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema-level root cause identified by the scourer. Total cost of cross-vendor analysis: \$0.27 USD.
翻译:基于LLM的编码智能体系统提示是控制智能体行为的软件构件,却缺乏传统软件所采用的测试基础设施。我们提出Arbiter框架,该框架结合形式化评估规则与多模型LLM扫描机制,用于检测系统提示中的干扰模式。通过对三大主流编码智能体系统提示——Claude Code(Anthropic)、Codex CLI(OpenAI)和Gemini CLI(Google)的应用,我们在无定向扫描阶段识别出152项发现,并在对某供应商的定向分析中标注出21种人工标记的干扰模式。研究表明:提示架构(单体式、扁平式、模块化)与观察到的故障类别存在强相关性,但与严重程度无关;多模型评估能发现与单模型分析在类别上完全不同的漏洞类型。其中一项扫描发现——Gemini CLI记忆系统中的结构性数据丢失——与Google提交并修复的问题一致,但该修复仅处理了表面症状,未解决扫描器识别出的模式级根本原因。跨供应商分析总成本:0.27美元。