Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).
翻译:阿拉伯语缺乏语义数据集和词义清单。最常用的阿拉伯语语义标注数据集是ArabGlossBERT,这是一个相对较小的数据集,包含167K个上下文-释义对(约60K正例和107K负例),这些数据来源于阿拉伯语词典。本文提出对ArabGlossBERT数据集进行扩充,通过使用(阿拉伯语-英语-阿拉伯语)机器回译方法增强数据。扩充后数据集规模增加至352K对(149K正例和203K负例)。我们采用不同数据配置在目标词义验证(TSV)任务上微调BERT,衡量数据增强的影响。总体而言,不同数据配置的准确率在78%至84%之间。尽管我们的方法与基线性能相当,但在部分实验中观察到某些词性标签的改进。此外,我们微调的模型基于覆盖更广泛词汇和上下文的大规模数据集进行训练。我们对每个词性(POS)的准确率进行了深入分析。