Nowadays, many researchers are focusing their attention on the subject of machine translation (MT). However, Persian machine translation has remained unexplored despite a vast amount of research being conducted in languages with high resources, such as English. Moreover, while a substantial amount of research has been undertaken in statistical machine translation for some datasets in Persian, there is currently no standard baseline for transformer-based text2text models on each corpus. This study collected and analysed the most popular and valuable parallel corpora, which were used for Persian-English translation. Furthermore, we fine-tuned and evaluated two state-of-the-art attention-based seq2seq models on each dataset separately (48 results). We hope this paper will assist researchers in comparing their Persian to English and vice versa machine translation results to a standard baseline.
翻译:如今,众多研究者将目光聚焦于机器翻译领域。然而,尽管在英语等高资源语言上已有大量研究,波斯语的机器翻译仍鲜有探索。此外,虽然针对波斯语部分数据集已开展诸多统计机器翻译研究,但目前尚缺乏基于Transformer的文本到文本模型在各语料库上的标准化基准。本研究系统收集并分析了用于波斯语-英语翻译的最具代表性且价值最高的平行语料库,进而对两种前沿的注意力机制序列到序列模型分别在各数据集上进行微调与评估(共计48项结果)。我们期望本文能帮助研究者将波斯语与英语双向机器翻译结果与标准化基准进行对比。