A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.
翻译:可解释性研究的一个目标是,从神经网络的激活中恢复潜在概念(特征)的解纠缠表示。特征质量通常是在孤立条件下,并在可能不成立的隐含独立性假设下进行评估。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)能在多大程度上将一个概念与另一个概念解纠缠。我们提出了一个多概念评估设置,使用了包括情感、领域、语态和时态在内的概念。我们评估了特征化器对每个概念产生解纠缠表示的效果,观察到特征通常仅对单一概念敏感,但概念也分布在许多特征上。然后,我们对这些特征进行引导,测量每个概念是否可独立操控,以及特征之间是否存在交互作用。即使在理想化设置中,引导一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立引导的选择性,并且证明两个特征在不同的空间中运作,也不足以声称它们将对一个概念具有选择性。这些结果强调了在多概念评估在可解释性研究中的重要性。