Ancient Chinese word segmentation (WSG) and part-of-speech tagging (POS) are important to study ancient Chinese, but the amount of ancient Chinese WSG and POS tagging data is still rare. In this paper, we propose a novel augmentation method of ancient Chinese WSG and POS tagging data using distant supervision over parallel corpus. However, there are still mislabeled and unlabeled ancient Chinese words inevitably in distant supervision. To address this problem, we take advantage of the memorization effects of deep neural networks and a small amount of annotated data to get a model with much knowledge and a little noise, and then we use this model to relabel the ancient Chinese sentences in parallel corpus. Experiments show that the model trained over the relabeled data outperforms the model trained over the data generated from distant supervision and the annotated data. Our code is available at https://github.com/farlit/ACDS.
翻译:古代汉语分词与词性标注是研究古代汉语的重要基础,但相关标注数据仍较为稀缺。本文提出一种利用平行语料库进行弱监督的新型数据增强方法,用于扩充古代汉语分词与词性标注数据。然而,弱监督过程中不可避免地存在错误标注和未标注的古代汉语词汇。为解决该问题,我们利用深度神经网络的记忆效应及少量标注数据,构建一个兼具丰富知识与较低噪声的模型,并基于该模型对平行语料库中的古代汉语句子进行重新标注。实验表明,基于重标数据训练的模型性能优于弱监督生成数据及原始标注数据训练的模型。相关代码已开源至https://github.com/farlit/ACDS。