We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks. Its primary goal is advancing the understanding and mitigation of adversarial attacks. FACADE aims to generate probabilistic distributions over circuits, which provide critical insights to their contribution to changes in the manifold properties of pseudo-classes, or high-dimensional modes in activation space, yielding a powerful tool for uncovering and combating adversarial attacks. Our approach seeks to improve model robustness, enhance scalable model oversight, and demonstrates promising applications in real-world deployment settings.
翻译:我们提出FACADE,这是一个新颖的概率几何框架,专为深度神经网络中的无监督机制性异常检测而设计。其核心目标是推动对抗性攻击的理解与缓解。FACADE旨在生成电路的概率分布,这些分布为揭示电路中伪类(即激活空间中的高维模态)流形性质变化的关键贡献提供了重要洞见,从而成为发现并抵御对抗性攻击的有力工具。该方法致力于提升模型鲁棒性、增强可扩展的模型监督能力,并在实际部署场景中展现出广阔的应用前景。