In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
翻译:本文提出了一种基于词嵌入的新语料库,该语料库包含从多个网络资源爬取的超过6100万个单词。我们设计了一个预处理流程,用于从爬取数据中过滤不需要的文本。随后,将清理后的词汇表输入到最先进的连续词袋模型、Skip-gram和GloVe词嵌入算法中。为了评估预训练嵌入,我们采用了常用的内在和外在评估方法。评估结果表明,在内在和外在评估方法上,连续词袋模型和Skip-gram的表现均优于GloVe以及现有的信德语fastText词嵌入。