Every mechanistic circuit carries an invisible asterisk: it reflects not just the model's computation, but the analyst's choice of pruning threshold. Change that choice and the circuit changes, yet current practice treats a single pruned subgraph as ground truth with no way to distinguish robust structure from threshold artifacts. We introduce CIRCUS, which reframes circuit discovery as a problem of uncertainty over explanations. CIRCUS prunes one attribution graph under B configurations, assigns each edge an empirical inclusion frequency s(e) in [0,1] measuring how robustly it survives across the configuration family, and extracts a consensus circuit of edges present in every view. This yields a principled core/contingent/noise decomposition (analogous to posterior model-inclusion indicators in Bayesian variable selection) that separates robust structure from threshold-sensitive artifacts, with negligible overhead. On Gemma-2-2B and Llama-3.2-1B, consensus circuits are 40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, consistently outperform influence-ranked and random baselines, and are confirmed causally relevant by activation patching.
翻译:每个机制电路都带有一个隐形的星号:它不仅反映模型的计算过程,还反映了分析者对剪枝阈值的选择。改变这一选择,电路便会随之改变,然而当前的研究惯例将单个剪枝后的子图视为真实结果,却无法区分鲁棒结构与阈值伪影。我们提出CIRCUS方法,将电路发现重构为解释不确定性问题。CIRCUS在B种配置下对单个归因图进行剪枝,为每条边赋予经验包含频率s(e)∈[0,1],用于衡量该边在配置族中的存活鲁棒性,并提取所有视角中共同存在的共识电路。这产生了原则性的核心/偶然/噪声分解(类似于贝叶斯变量选择中的后验模型包含指标),能够将鲁棒结构与阈值敏感伪影区分开来,且计算开销可忽略不计。在Gemma-2-2B和Llama-3.2-1B模型上,共识电路比所有配置的并集小40倍,同时保留了可比的影响流解释能力,持续优于基于影响排序和随机选择的基线方法,并通过激活修补实验确认了其因果相关性。