LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings

Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.

翻译：句子嵌入模型在多种自然语言处理任务中发挥着关键作用，例如主题建模、文档聚类和推荐系统。然而，这些模型严重依赖于平行数据，而对于包括卢森堡语在内的许多低资源语言而言，此类数据往往稀缺。这种稀缺性导致针对这些语言的单语和跨语言句子嵌入模型性能欠佳。为解决这一问题，我们构建了一个相对较小但高质量的人工生成跨语言平行数据集，用以训练LuxEmbedder——一个具有强大跨语言能力的增强型卢森堡语句子嵌入模型。此外，我们提供的证据表明，在平行训练数据集中纳入低资源语言，相较于仅依赖高资源语言对，可能对其他低资源语言更为有益。进一步地，鉴于低资源语言缺乏句子嵌入基准测试集，我们专门为卢森堡语创建了一个复述检测基准，旨在部分填补这一空白并推动相关研究的深入。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日