We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models, such as the vector difference between the representations of "easy" and "easiest" being parallel to that between "lucky" and "luckiest". For this, we ask whether finding a linear property in one model implies that any model that induces the same distribution has that property, too. To answer that, we first prove an identifiability result to characterize distribution-equivalent next-token predictors, lifting a diversity requirement of previous results. Second, based on a refinement of relational linearity [Paccanaro and Hinton, 2001; Hernandez et al., 2024], we show how many notions of linearity are amenable to our analysis. Finally, we show that under suitable conditions, these linear properties either hold in all or none distribution-equivalent next-token predictors.
翻译:我们分析可识别性作为线性特性在语言模型中普遍存在的一种可能解释,例如"easy"与"easiest"表征间的向量差平行于"lucky"与"luckiest"间的向量差。为此,我们探究在一个模型中发现线性特性是否意味着任何能诱导相同分布的模型也具备该特性。为回答此问题,我们首先证明了一个可识别性结果来刻画分布等价的下一个词预测器,这消除了先前结果对多样性的要求。其次,基于对关系线性[Paccanaro and Hinton, 2001; Hernandez et al., 2024]的精细化定义,我们展示了多种线性概念如何适用于我们的分析框架。最后,我们证明在适当条件下,这些线性特性在所有分布等价的下一个词预测器中要么全部成立,要么全部不成立。