Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3~6.1 points in F1 and 1.5~6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.
翻译:现有词汇对齐方法多依赖人工标注对齐数据集或平行语料库,这限制了其实用性。为缓解对人工数据的依赖,本文通过放宽对正确、完全对齐和平行句子的要求,拓宽了监督信号的来源。具体而言,我们构建了含噪声、部分对齐和非平行的段落数据,并利用这种大规模弱监督数据集,通过跨度预测进行词汇对齐预训练。跨多种设置的大量实验表明,我们提出的WSPAlign方法是一种无需人工数据即可有效且可扩展地预训练词汇对齐器的方法。在标准基准测试上微调后,WSPAlign将最佳有监督基线的F1值提升3.3~6.1个点、AER值提升1.5~6.1个点,创下新的最优性能。此外,WSPAlign在少样本、零样本和跨语言测试中均取得与对应基线相当的竞争力,这表明与现有方法相比,WSPAlign对低资源语言具有更高的实用性潜力。