TransAlign：机器翻译编码器同样是强大的词对齐工具 (TransAlign: Machine Translation Encoders are Strong Word Aligners, Too)

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

翻译：对于世界上大多数语言和自然语言处理任务而言，由于缺乏大规模训练数据，基于翻译的策略——如“翻译-测试”（在从目标语言翻译而来的噪声源语言数据上进行评估）和“翻译-训练”（在从源语言翻译而来的噪声目标语言数据上进行训练）——已成为跨语言迁移（XLT）中具有竞争力的方法。对于词元分类任务，这些策略需要进行标签投影：将原始句子中每个词元的标签映射到其翻译文本中的对应词元。为此，通常利用基于编码器语言模型（如 mBERT 或 LaBSE）衍生的多语言词对齐工具。尽管机器翻译与词对齐之间存在明显的关联，但利用机器翻译模型提取对齐的研究大多局限于利用编码器-解码器架构中的交叉注意力，导致词对齐效果不佳。与此相反，在本研究中，我们提出了 TransAlign，一种新颖的词对齐工具，它利用了大规模多语言机器翻译模型的编码器。我们证明，TransAlign 不仅在词对齐性能上表现强劲，而且在基于机器翻译的跨语言迁移词元分类任务中，显著优于主流词对齐工具以及最先进的非词对齐标签投影方法。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日