Despite their remarkable ability to capture linguistic nuances across diverse languages, questions persist regarding the degree of alignment between languages in multilingual embeddings. Drawing inspiration from research on high-dimensional representations in neural language models, we employ clustering to uncover latent concepts within multilingual models. Our analysis focuses on quantifying the \textit{alignment} and \textit{overlap} of these concepts across various languages within the latent space. To this end, we introduce two metrics \CA{} and \CO{} aimed at quantifying these aspects, enabling a deeper exploration of multilingual embeddings. Our study encompasses three multilingual models (\texttt{mT5}, \texttt{mBERT}, and \texttt{XLM-R}) and three downstream tasks (Machine Translation, Named Entity Recognition, and Sentiment Analysis). Key findings from our analysis include: i) deeper layers in the network demonstrate increased cross-lingual \textit{alignment} due to the presence of language-agnostic concepts, ii) fine-tuning of the models enhances \textit{alignment} within the latent space, and iii) such task-specific calibration helps in explaining the emergence of zero-shot capabilities in the models.\footnote{The code is available at \url{https://github.com/baselmousi/multilingual-latent-concepts}}
翻译:尽管多语言嵌入模型在捕捉不同语言的细微差别方面表现出色,但关于这些嵌入中语言间对齐程度的问题仍然存在。受神经网络语言模型中高维表示研究的启发,我们采用聚类方法来揭示多语言模型中的潜在概念。我们的分析重点在于量化潜在空间中这些概念在不同语言间的\textit{对齐性}和\textit{重叠度}。为此,我们提出了两个度量指标 \CA{} 和 \CO{},旨在量化这些特性,从而实现对多语言嵌入的更深入探索。本研究涵盖三种多语言模型(\texttt{mT5}、\texttt{mBERT} 和 \texttt{XLM-R})以及三项下游任务(机器翻译、命名实体识别和情感分析)。分析的主要发现包括:i)由于语言无关概念的存在,网络深层表现出更强的跨语言\textit{对齐性};ii)模型的微调能增强潜在空间内的\textit{对齐性};iii)此类任务特定的校准有助于解释模型中零样本能力的涌现。\footnote{代码发布于 \url{https://github.com/baselmousi/multilingual-latent-concepts}}