Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.
翻译:多语言语言模型在整体性能上表现优异,但其在不同语言、文字和文化间的行为往往难以预测。我们认为,对此类模型的机制解释应满足一种因果标准:论断必须能经受因果干预的检验,并且必须在扰动表层形式而保持意义不变的不同环境之间实现交叉验证。我们将参考族形式化为谓词保持的变体,并引入三角验证作为一种接受准则,要求满足必要性(消融该电路会削弱目标行为)、充分性(修补激活值可迁移该行为)以及不变性(上述两种效应在参考族范围内保持方向稳定性且具有足够强度)。为提供候选子图,我们采用自动电路发现方法,并通过三角验证对这些候选电路进行接受或拒绝判定。我们将三角验证建立在因果抽象框架中,将其表述为在置换干预分布上的近似变换分数,将其与实用可解释性研究议程相关联,并提出了一个涵盖多种模型族、语言对和任务的对比实验方案。三角验证为机制性论断提供了一个可证伪的标准,能够过滤那些通过单环境测试但在跨语言不变性上失败的伪相关电路。