In the absence of readily available labeled data for a given sequence labeling task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data. Annotation projection has often been formulated as the task of transporting, on parallel corpora, the labels pertaining to a given span in the source language into its corresponding span in the target language. In this paper we present T-Projection, a novel approach for annotation projection that leverages large pretrained text-to-text language models and state-of-the-art machine translation technology. T-Projection decomposes the label projection task into two subtasks: (i) A candidate generation step, in which a set of projection candidates using a multilingual T5 model is generated and, (ii) a candidate selection step, in which the generated candidates are ranked based on translation probabilities. We conducted experiments on intrinsic and extrinsic tasks in 5 Indo-European and 8 low-resource African languages. We demostrate that T-projection outperforms previous annotation projection methods by a wide margin. We believe that T-Projection can help to automatically alleviate the lack of high-quality training data for sequence labeling tasks. Code and data are publicly available.
翻译:在缺乏现成标注数据用于特定序列标注任务和语言的情况下,标注投影已被提出作为自动生成标注数据的可行策略之一。标注投影通常被形式化为在平行语料上,将源语言中某个跨度对应的标签迁移至目标语言中相应跨度的任务。本文提出T-Projection这一新型标注投影方法,该方法利用大规模预训练的文本到文本语言模型与最先进的机器翻译技术。T-Projection将标签投影任务分解为两个子任务:(i) 候选生成步骤,该步骤使用多语言T5模型生成一组投影候选;(ii) 候选选择步骤,该步骤基于翻译概率对生成的候选进行排序。我们在5种印欧语系语言和8种低资源非洲语言上进行了内在与外在任务实验,证明T-Projection大幅优于以往的标注投影方法。我们认为T-Projection有助于自动缓解序列标注任务中高质量训练数据匮乏的问题。代码与数据已公开发布。