Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.
翻译:词语通常被表示在高维向量空间中,该空间编码了词语之间的语义相似性,从而支持诸如检索同义词、反义词和相关上下文等下游应用。然而,尽管多语言语言模型(LMs)近期取得了进展,这些模型的表示在语义检索场景中的有效性尚未得到全面探索。为填补这一空白,本文提出了MINERS基准,旨在评估多语言LMs在语义检索任务中的能力,包括通过检索增强上下文的双语文本挖掘和分类。我们构建了一个综合框架,以评估LMs在超过200种多样化语言(包括极具挑战性的跨语言和语码转换环境下的极低资源语言)中检索样本的鲁棒性。我们的结果表明,仅通过检索语义相似的嵌入向量,即可获得与最先进方法相竞争的性能,而无需任何微调。