The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important contributions: First, we provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. Second, we implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese. Our efforts are built on a collaboration of the two teams that submitted the most effective runs to the TREC evaluation. These contributions provide a firm foundation for future advances.
翻译:多语言语言模型的出现重新激发了人们对跨语言信息检索(CLIR)的兴趣,该任务旨在使用一种语言的查询检索另一种语言的文档。然而,快速发展的进程导致了令人困惑的方法多样性,且可重复性落后于现有技术水平。在此背景下,我们的工作做出了两项重要贡献:首先,我们构建了一个概念框架,以单语言检索的多阶段架构为基础,对跨语言检索的不同方法进行系统组织。其次,我们在Anserini和Pyserini信息检索工具包中针对TREC 2022 NeuCLIR赛道(涵盖波斯语、俄语和汉语)的测试集合,实现了简单而有效的可重复基线系统。我们的工作建立在向TREC评测提交最有效运行结果的两个团队的合作基础上。这些贡献为未来的进展奠定了坚实基础。