Recent studies in natural language processing (NLP) have focused on modern languages and achieved state-of-the-art results in many tasks. Meanwhile, little attention has been paid to ancient texts and related tasks. Classical Chinese first came to Japan approximately 2,000 years ago. It was gradually adapted to a Japanese form called Kanbun-Kundoku (Kanbun) in Japanese reading and translating methods, which has significantly impacted Japanese literature. However, compared to the rich resources for ancient texts in mainland China, Kanbun resources remain scarce in Japan. To solve this problem, we construct the first Classical-Chinese-to-Kanbun dataset in the world. Furthermore, we introduce two tasks, character reordering and machine translation, both of which play a significant role in Kanbun comprehension. We also test the current language models on these tasks and discuss the best evaluation method by comparing the results with human scores. We release our code and dataset on GitHub.
翻译:近年来,自然语言处理(NLP)领域的研究主要聚焦于现代语言,并在诸多任务中取得了最先进的成果。与此同时,古代文本及相关任务却鲜有关注。汉文约在两千年前传入日本,并逐渐演变为一种日式的阅读与翻译形式——汉文训读(Kanbun),对日本文学产生了深远影响。然而,与中国大陆丰富的古代文本资源相比,日本的汉文资源仍显匮乏。为解决这一问题,我们构建了全球首个汉文至日式训读的数据集。此外,我们提出了字符重排与机器翻译两项任务,二者对汉文理解均具有重要意义。我们还在这些任务上测试了当前的语言模型,并通过与人工评分结果的对比,探讨了最佳评估方法。我们已在GitHub上公开了代码与数据集。