Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language

Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddings. Then when it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings given that even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala which is in dire need of help seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Even though we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings.

翻译：自嵌入技术问世以来，它们已成为自然语言处理（NLP）各类任务的主要组成部分，取代了早期的表示形式。尽管多语言嵌入已被用于日益增多的多语言任务，但由于平行训练数据的稀缺，僧伽罗语等低资源语言往往更侧重于单语嵌入。然而在面对前述多语言任务时，由于即使嵌入空间因相同的训练过程而具有相似的几何结构，所考虑语言的嵌入仍未能对齐，因此利用这些单语嵌入成为挑战。这一问题可通过嵌入对齐任务解决。然而在此领域，高资源语言对备受关注，而诸如僧伽罗语这类亟需帮助的低资源语言却似乎被忽视。本文基于现有对齐技术尝试对齐僧伽罗语与英语的词嵌入空间，并引入僧伽罗语嵌入对齐的基准。此外，为促进监督式对齐，作为中间任务，我们同时引入僧伽罗语-英语对齐数据集。这些数据集作为监督式词嵌入对齐的锚定数据集。尽管我们未能获得与法语、德语或中文等高资源语言相媲美的结果，但我们相信本研究为英语与僧伽罗语嵌入之间更专门化的对齐奠定了基础。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日