Zero-resource cross-lingual transfer approaches aim to apply supervised models from a source language to unlabelled target languages. In this paper we perform an in-depth study of the two main techniques employed so far for cross-lingual zero-resource sequence labelling, based either on data or model transfer. Although previous research has proposed translation and annotation projection (data-based cross-lingual transfer) as an effective technique for cross-lingual sequence labelling, in this paper we experimentally demonstrate that high capacity multilingual language models applied in a zero-shot (model-based cross-lingual transfer) setting consistently outperform data-based cross-lingual transfer approaches. A detailed analysis of our results suggests that this might be due to important differences in language use. More specifically, machine translation often generates a textual signal which is different to what the models are exposed to when using gold standard data, which affects both the fine-tuning and evaluation processes. Our results also indicate that data-based cross-lingual transfer approaches remain a competitive option when high-capacity multilingual language models are not available.
翻译:零资源跨语言迁移方法旨在将源语言的监督模型应用于无标签的目标语言。本文深入研究了目前两种主要跨语言零资源序列标注技术——基于数据迁移和模型迁移的方法。尽管先前研究提出翻译与标注投射(基于数据的跨语言迁移)是一种有效的跨语言序列标注技术,但本文通过实验证明,在零样本(基于模型的跨语言迁移)设置下应用的高容量多语言语言模型始终优于基于数据的跨语言迁移方法。对结果的详细分析表明,这一优势可能源于语言使用方式的重要差异。具体而言,机器翻译生成的文本信号与模型使用黄金标准数据时接触的信号存在显著差异,这同时影响了微调和评估过程。我们的结果还表明,当缺乏高容量多语言语言模型时,基于数据的跨语言迁移方法仍具有竞争力的选择价值。