Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic "hub" which integrates information from various modality-specific "spokes" regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model's dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.
翻译:现代语言模型能够处理多种语言和模态的输入。我们提出假说:模型通过跨异构数据类型(例如不同语言与模态)学习共享表征空间来获得这种能力,该空间使语义相似的输入彼此邻近,即使它们来自不同模态或语言。我们将其称为语义枢纽假说,这借鉴了神经科学中的枢纽-辐条模型(Patterson等人,2007),该模型认为人脑中的语义知识通过一个跨模态语义“枢纽”进行组织,该枢纽整合来自各模态特异性“辐条”区域的信息。我们首先证明,对于不同语言中语义等价的输入,模型在中间层的表征是相似的,且该空间可通过logit lens借助模型的主导预训练语言进行解释。这种趋势延伸至其他数据类型,包括算术表达式、代码以及视觉/音频输入。在共享表征空间中对一种数据类型进行干预,也会可预测地影响模型在其他数据类型上的输出,这表明该共享表征空间并非仅仅是大规模宽泛数据训练产生的残留副产物,而是模型在处理输入时主动利用的机制。