Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations' intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.
翻译:组合性,即表达式的意义由其组成部分的意义和句法规则构建而成,使得人类语言具有无限生成能力。人工智能语言模型首次在多项组合泛化任务中达到人类水平表现。然而,这些能力背后的表征机制仍有待深入探究。本研究采用高层几何方法,通过关联数据集的组合性程度与语言模型表征的内在维度(特征复杂度的度量)来探讨该问题。我们发现不仅数据集的组合性程度反映在表征的内在维度中,而且组合性与几何复杂度之间的关系源于训练过程中习得的语言特征。最后,我们的分析揭示了线性与非线性维度之间的显著对比:前者编码语言组合的形式特征,而后者编码语义特征。