In cross-lingual named entity recognition (NER), self-training is commonly used to bridge the linguistic gap by training on pseudo-labeled target-language data. However, due to sub-optimal performance on target languages, the pseudo labels are often noisy and limit the overall performance. In this work, we aim to improve self-training for cross-lingual NER by combining representation learning and pseudo label refinement in one coherent framework. Our proposed method, namely ContProto mainly comprises two components: (1) contrastive self-training and (2) prototype-based pseudo-labeling. Our contrastive self-training facilitates span classification by separating clusters of different classes, and enhances cross-lingual transferability by producing closely-aligned representations between the source and target language. Meanwhile, prototype-based pseudo-labeling effectively improves the accuracy of pseudo labels during training. We evaluate ContProto on multiple transfer pairs, and experimental results show our method brings in substantial improvements over current state-of-the-art methods.
翻译:在跨语言命名实体识别(NER)中,自训练方法通常通过利用伪标注的目标语言数据来弥合语言差异。然而,由于模型在目标语言上表现欠佳,生成的伪标签常含有噪声,限制了整体性能。在本工作中,我们旨在通过将表示学习与伪标签优化整合于统一框架中,改进跨语言NER的自训练方法。我们提出的方法ContProto主要包含两个组件:(1)对比自训练和(2)基于原型的伪标签生成。对比自训练通过分离不同类别的簇来促进跨度分类,并通过生成源语言与目标语言之间紧密对齐的表示来增强跨语言迁移能力。同时,基于原型的伪标签生成有效提升了训练过程中伪标签的准确性。我们在多个跨语言迁移对上评估了ContProto,实验结果表明,我们的方法相较当前最先进方法带来了显著改进。