Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3~6.1 points in F1 and 1.5~6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.
翻译:现有的大多数单词对齐方法依赖于人工对齐数据集或平行语料库,这限制了其应用范围。为减轻对人工数据的依赖,我们通过放宽对正确、完全对齐和平行句子的要求来扩展监督来源。具体而言,我们引入了含噪、部分对齐和非平行段落的数据。随后,利用此类大规模弱监督数据集,通过跨度预测进行单词对齐预训练。大量不同设置下的实验实证表明,我们提出的方法(命名为WSPAlign)是一种无需人工数据即可有效且可扩展的单词对齐器预训练方式。在标准基准上进行微调后,WSPAlign在F1值上提升了最佳监督基线3.3~6.1个点,在AER上提升了1.5~6.1个点,创下了新的最优水平。此外,在少样本、零样本和跨语言测试中,WSPAlign与相应基线相比也取得了具有竞争力的性能,这表明现有方法相比,WSPAlign对低资源语言可能更具实用性。