Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.
翻译:句子嵌入模型在多种自然语言处理任务中发挥着关键作用,例如主题建模、文档聚类和推荐系统。然而,这些模型严重依赖于平行数据,而对于包括卢森堡语在内的许多低资源语言而言,此类数据往往稀缺。这种稀缺性导致针对这些语言的单语和跨语言句子嵌入模型性能欠佳。为解决这一问题,我们构建了一个相对较小但高质量的人工生成跨语言平行数据集,用以训练LuxEmbedder——一个具有强大跨语言能力的增强型卢森堡语句子嵌入模型。此外,我们提供的证据表明,在平行训练数据集中纳入低资源语言,相较于仅依赖高资源语言对,可能对其他低资源语言更为有益。进一步地,鉴于低资源语言缺乏句子嵌入基准测试集,我们专门为卢森堡语创建了一个复述检测基准,旨在部分填补这一空白并推动相关研究的深入。