Document retrieval in many languages has been largely relying on multi-lingual models, and leveraging the vast wealth of English training data. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embeddings. In this work, we introduce (1) a hard-negative augmented version of the Japanese MMARCO dataset and (2) JaColBERT, a document retrieval model built on the ColBERT model architecture, specifically for Japanese. JaColBERT vastly outperform all previous monolingual retrieval approaches and competes with the best multilingual methods, despite unfavourable evaluation settings (out-of-domain vs. in-domain for the multilingual models). JaColBERT reaches an average Recall@10 of 0.813, noticeably ahead of the previous monolingual best-performing model (0.716) and only slightly behind multilingual-e5-base (0.820), though more noticeably behind multilingual-e5-large (0.856). These results are achieved using only a limited, entirely Japanese, training set, more than two orders of magnitudes smaller than multilingual embedding models. We believe these results show great promise to support retrieval-enhanced application pipelines in a wide variety of domains.
翻译:多语言文档检索在很大程度上依赖于多语言模型,并借助英语训练数据的丰富资源。在日语中,基于深度学习的最佳检索方法通常依赖多语言稠密嵌入。本文提出:(1) 日语MMARCO数据集的难负样本增强版本,(2) 基于ColBERT模型架构构建的文档检索模型JaColBERT,该模型专为日语设计。尽管在评估设置上存在不利条件(多语言模型采用域内评估,而JaColBERT为域外评估),JaColBERT仍大幅超越所有先前单语言检索方法,并与最佳多语言方法性能相当。JaColBERT的平均Recall@10达到0.813,显著领先于此前最佳单语言模型(0.716),略低于多语言e5-base(0.820),但与多语言e5-large(0.856)差距较为明显。这些成果仅依赖有限的纯日语训练集实现,其规模比多语言嵌入模型小两个数量级以上。我们认为这些结果展现了支持各领域检索增强应用管线的巨大潜力。