Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.
翻译:挖掘平行文档对是一项重大挑战,因为现有的句子嵌入模型通常具有有限的上下文窗口,阻碍了它们有效捕获文档级信息。另一个被忽视的问题是缺乏由高质量平行文档对组成的具体评估基准,用于评估文档级挖掘方法,尤其对于印度语言而言。在本研究中,我们引入了Pralekha,一个用于文档级对齐评估的大规模基准。Pralekha包含超过200万份文档,其中非对齐与对齐对的比例为1:2,涵盖11种印度语言和英语。利用Pralekha,我们从三个维度评估了各种文档级挖掘方法:嵌入模型、粒度级别和对齐算法。为了解决使用句子和块级对齐来对齐文档的挑战,我们提出了一种新颖的评分方法——文档对齐系数(DAC)。DAC相较于基线池化方法显示出显著改进,尤其在噪声场景下,在精确率上平均提升20-30%,在F1分数上平均提升15-20%。这些结果突显了DAC在印度语言平行文档挖掘中的有效性。