Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts by first modeling the relation between subject and object as a linear relational embedding (LRE). We find that inverting the LRE and using earlier object layers results in a powerful technique for finding concept directions that outperforms standard black-box probing classifiers. We evaluate LRCs on their performance as concept classifiers as well as their ability to causally change model output.
翻译:Transformer语言模型已被证明能够将概念表示为隐藏激活潜空间中的方向。然而,对于任何人类可解释的概念,我们如何在其潜空间中找到该方向?我们提出了一种名为"线性关系概念"(LRC)的技术,该技术通过首先将主体与客体之间的关系建模为线性关系嵌入(LRE),从而找到对应人类可解释概念的方向。研究发现,对LRE进行逆变换并利用早期客体层,能够形成一种强大的概念方向识别技术,其性能优于标准的黑盒探测分类器。我们通过概念分类性能以及因果性改变模型输出的能力对LRC进行了评估。