Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any given human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts at a given hidden layer in a transformer LM by first modeling the relation between subject and object as a linear relational embedding (LRE). While the LRE work was mainly presented as an exercise in understanding model representations, we find that inverting the LRE while using earlier object layers results in a powerful technique to find concept directions that both work well as a classifier and causally influence model outputs.
翻译:Transformer语言模型已被证明能在隐藏激活的潜在空间中,将概念表示为方向向量。然而,对于任意给定的人类可解释概念,我们如何在其潜在空间中寻找对应方向?本文提出一种名为线性关系概念(LRC)的技术,用于在Transformer语言模型的指定隐藏层中,通过首先将主语与宾语之间的关联建模为线性关系嵌入(LRE),来寻找对应人类可解释概念的方向。尽管LRE研究主要被呈现为理解模型表征的练习,但我们发现,在利用较早宾语层的同时对LRE进行逆运算,能形成一种高效的概念方向寻找技术,该方向既可作为分类器良好运作,又能在因果层面影响模型输出。