Interpreting ancient Chinese has been the key to comprehending vast Chinese literature, tradition, and civilization. In this paper, we propose Erya for ancient Chinese translation. From a dataset perspective, we collect, clean, and classify ancient Chinese materials from various sources, forming the most extensive ancient Chinese resource to date. From a model perspective, we devise Erya training method oriented towards ancient Chinese. We design two jointly-working tasks: disyllabic aligned substitution (DAS) and dual masked language model (DMLM). From an evaluation perspective, we build a benchmark to judge ancient Chinese translation quality in different scenarios and evaluate the ancient Chinese translation capacities of various existing models. Our model exhibits remarkable zero-shot performance across five domains, with over +12.0 BLEU against GPT-3.5 models and better human evaluation results than ERNIE Bot. Subsequent fine-tuning further shows the superior transfer capability of Erya model with +6.2 BLEU gain. We release all the above-mentioned resources at https://github.com/RUCAIBox/Erya.
翻译:解读古文是理解浩如烟海的中国文学、传统与文明的关键。本文提出古文翻译模型Erya。在数据集方面,我们收集、清洗并分类来自多种来源的古文材料,构建了迄今为止规模最大的古文资源。在模型方面,我们设计了面向古文的Erya训练方法,并提出了两项协同任务:双音节对齐替换(DAS)与双掩码语言模型(DMLM)。在评估方面,我们构建了一套基准,用于评判不同场景下古文的翻译质量,并评估多种现有模型的古文翻译能力。我们的模型在五个领域展现出显著的零样本性能,BLEU值较GPT-3.5模型提升超过12.0,人工评估结果亦优于文心一言。后续微调进一步展示了Erya模型的卓越迁移能力,BLEU值提升6.2。我们已将所有上述资源发布于https://github.com/RUCAIBox/Erya。