Constrained Decoding for Cross-lingual Label Projection

Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.

翻译：基于多语言大语言模型的零样本跨语言迁移已成为低资源语言（缺乏标注训练数据）领域的主流学习范式。然而，在涉及词语与短语细粒度预测的自然语言处理任务中，零样本跨语言迁移学习的性能远落后于监督微调方法。因此，通常通过翻译与标签投影来进一步提升效果：（1）将高资源语言（如英语）的带金标准标签的训练数据翻译为低资源语言；（2）将低资源语言的测试数据翻译为高资源语言进行推理，再将预测的跨度级标签投影回原始测试数据。然而，当前最先进的基于标记的标签投影方法因在翻译模型输入中注入额外标签标记而导致翻译质量下降。本研究探索了利用约束解码实现标签投影的新方向以克服上述问题。新方法不仅能保持翻译文本质量，还具备同时适用于训练数据翻译策略与测试数据翻译策略的通用性。这种通用性至关重要，因为实验表明：相较于仅翻译训练数据，翻译测试数据可显著提升性能。我们在涵盖20种语言的两项跨语言迁移任务（命名实体识别与事件论元抽取）上进行评估，结果表明本方法大幅优于当前最先进的基于标记的方法，且比依赖外部词对齐的其他标签投影方法展现更优性能。