Document alignment is necessary for the hierarchical mining (Ba\~n\'on et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzm\'an, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (https://github.com/EternalEdenn/EmbDA).
翻译:文档对齐是分层挖掘(Bañon等人,2020;Morishita等人,2022)的必要步骤,其目标是在同一网络域内跨源语言和目标语言对齐文档。目前已开发出多种基于句子嵌入的高精度方法,例如TK-PERT(Thompson和Koehn,2020)与最优传输(OT)(Clark等人,2019;El-Kishky和Guzmán,2020)。然而,考虑到网络挖掘数据的海量规模,必须同时兼顾准确性与处理速度。本文提出一种用于计算文档间相似度的跨语言双向最大相似度评分(BiMax),以提升OT方法的计算效率。实验表明,在WMT16双语文档对齐任务中,BiMax在达到与OT相当准确率的同时,实现了约100倍的速度提升。此外,本文还对当前最先进的多语言句子嵌入模型进行了全面的性能分析。文中所有对齐方法均已开源为名为EmbDA的工具(https://github.com/EternalEdenn/EmbDA)。