Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.
翻译:汉文化圈的历史与语言联系促使研究者在处理韩国与日本的古代文献时,常借助古典中文资源进行跨语言迁移。本文对从古典中文到汉文(韩国古代书面语)与汉文训读(日本古代书面语)的跨语言可迁移性假设提出质疑。我们在机器翻译、命名实体识别及标点恢复任务上的实验表明:古典中文数据集对汉文书写的古代韩语文献的语言模型性能影响甚微——序列标注任务的F1分数差异在±0.0068范围内,翻译任务的BLEU分数提升最高仅+0.84。这种局限性在不同模型规模、架构及领域专用数据集中持续存在。分析显示:随着汉文本地语言数据的增加,古典中文资源的效益迅速衰减;仅在韩日历史文献的极低资源场景中才显现显著改进。这些复杂结果表明,需要审慎的实证验证而非盲目假设跨语言迁移的益处。