Sinhala-English Parallel Word Dictionary Dataset

Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.

翻译：平行数据集对于执行和评估任何多语言任务至关重要。然而，当所涉及的语言对之一为低资源语言时，由于人工标注的匮乏，现有的自上而下平行数据（如语料库）在数量和质量上均显不足。因此，对于低资源语言而言，采用自下而上的方法更为可行，即首先开发细粒度较高的配对数据集（如词典数据集）。这些数据集随后可用于中等任务，如监督式多语言词嵌入对齐。这些对齐结果进而可指导更高级任务，如用于机器翻译的句子或段落级文本语料库对齐。尽管比生成和对齐低资源语言的大规模语料库更为可行，但出于相同原因（大型研究实体对此类语言关注不足），即使这些细粒度的数据集在一些低资源语言中仍然匮乏。我们观察到，低资源语言僧伽罗语目前缺乏免费开放的词典数据集。因此，在本工作中，我们引入了三个平行英语-僧伽罗语单词词典（En-Si-dict-large、En-Si-dict-filtered、En-Si-dict-FastText），以辅助涉及英语和僧伽罗语的多语言自然语言处理任务。本文阐述了数据集创建流程以及为验证数据集质量所进行的测试实验结果。数据集及相关脚本可在 https://github.com/kasunw22/sinhala-para-dict 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日