Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.
翻译:概念激活向量(CAVs)是可解释人工智能领域的一种工具,为理解人类可理解概念如何编码于模型潜在空间提供了一种有前景的方法。它们通过计算属于概念类或非概念示例的输入在隐藏层的激活值而得到。从概率视角出发,(非)概念输入的分布会诱导出CAV的分布,使其成为潜在空间中的一个随机向量。这使我们能够推导出不同类型CAV的均值与协方差,从而形成统一的理论框架。该概率视角同时揭示了一个潜在的脆弱性:CAV可能强烈依赖于相当任意的非概念分布,这一因素在先前工作中很大程度上被忽视。我们通过一个简单而有效的对抗性攻击示例对此加以说明,强调了进行更系统性研究的必要性。