We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.
翻译:摘要:我们提出SentAlign,一种高精度句子对齐工具,专为处理超大规模平行文档对设计。在用户定义参数条件下,对齐算法可评估包含数千句的较大文档中所有可能的对齐路径,并采用分治策略处理包含数万句的文档。其评分函数基于LaBSE双语语句表示。在德语-法语和英语-冰岛语两个不同评估集上,以及下游机器翻译任务中,SentAlign均优于其他五种句子对齐工具。