Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research. We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility.
翻译:先前研究表明,上下文语言模型输出的表示比静态类型嵌入更具各向异性,并且通常表现出异常维度。这一现象在单语和多语言模型中似乎均成立,尽管针对多语言环境的研究相对较少。这些异常值产生的原因及其对表示的影响仍是当前研究的热点。我们研究了多个预训练多语言语言模型中的异常维度及其与各向异性的关系。重点聚焦于跨语言语义相似度任务,因为这是评估多语言表示的自然任务。具体而言,我们考察了句子表示。基于并行资源(非始终可用)微调得到的句子转换器在此任务上表现更优,并证明其表示更具各向同性。然而,我们的目标是广义上改进多语言表示。我们探究了仅通过变换嵌入空间(无需微调)能弥补多少性能差异,并可视化所得空间。我们测试了不同操作:移除单个异常维度、基于聚类的各向同性增强以及ZCA白化。为保障可复现性,我们公开了代码。