We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English capabilities to create better representations for non-English papers. Our best model significantly outperforms strong baselines by 7-16% (in mean average precision).
翻译:本文开发并评估了多语言科学文档相似度测量模型。此类模型可用于发现不同语言中的相关研究工作,从而帮助多语言研究者更高效地查找和探索论文。我们提出了首个多语言科学文档数据集——开放获取多语言科学文档(OpenMSD),该数据集包含103种语言的7400万篇论文及7.78亿条引用对。基于OpenMSD,我们预训练了科学领域专用语言模型,并探索了多种策略以生成“相关”论文对来微调模型,包括混合使用引用、共引和书目耦合对。为提升模型对非英语论文的性能,我们探索利用生成式语言模型为非英语论文补充英文摘要,从而借助模型的英语能力为非英语论文生成更优的表示。我们的最佳模型在平均精度均值指标上显著超越强基线模型7%-16%。