Machine learning models are trained with relatively simple objectives, such as next token prediction. However, on deployment, they appear to capture a more fundamental representation of their input data. It is of interest to understand the nature of these representations to help interpret the model's outputs and to identify ways to improve the salience of these representations. Concept vectors are constructions aimed at attributing concepts in the input data to directions, represented by vectors, in the model's latent space. In this work, we introduce concept boundary vectors as a concept vector construction derived from the boundary between the latent representations of concepts. Empirically we demonstrate that concept boundary vectors capture a concept's semantic meaning, and we compare their effectiveness against concept activation vectors.
翻译:机器学习模型通常以相对简单的目标进行训练,例如下一词元预测。然而,在实际部署中,它们似乎能捕捉输入数据更本质的表征。理解这些表征的本质有助于解释模型的输出,并找到提升这些表征显著性的方法,因此具有重要意义。概念向量是一种旨在将输入数据中的概念归因于模型潜在空间中由向量表示的方向的构造。在本研究中,我们引入了概念边界向量作为一种概念向量构造,它源自概念潜在表征之间的边界。通过实证,我们证明了概念边界向量能够捕捉概念的语义含义,并将其效果与概念激活向量进行了比较。