Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddings. Then when it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings given that even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala which is in dire need of help seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Even though we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings.
翻译:自嵌入技术问世以来,它们已成为自然语言处理(NLP)各类任务的主要组成部分,取代了早期的表示形式。尽管多语言嵌入已被用于日益增多的多语言任务,但由于平行训练数据的稀缺,僧伽罗语等低资源语言往往更侧重于单语嵌入。然而在面对前述多语言任务时,由于即使嵌入空间因相同的训练过程而具有相似的几何结构,所考虑语言的嵌入仍未能对齐,因此利用这些单语嵌入成为挑战。这一问题可通过嵌入对齐任务解决。然而在此领域,高资源语言对备受关注,而诸如僧伽罗语这类亟需帮助的低资源语言却似乎被忽视。本文基于现有对齐技术尝试对齐僧伽罗语与英语的词嵌入空间,并引入僧伽罗语嵌入对齐的基准。此外,为促进监督式对齐,作为中间任务,我们同时引入僧伽罗语-英语对齐数据集。这些数据集作为监督式词嵌入对齐的锚定数据集。尽管我们未能获得与法语、德语或中文等高资源语言相媲美的结果,但我们相信本研究为英语与僧伽罗语嵌入之间更专门化的对齐奠定了基础。