Effective cross-lingual dense retrieval methods that rely on multilingual pre-trained language models (PLMs) need to be trained to encompass both the relevance matching task and the cross-language alignment task. However, cross-lingual data for training is often scarcely available. In this paper, rather than using more cross-lingual data for training, we propose to use cross-lingual query generation to augment passage representations with queries in languages other than the original passage language. These augmented representations are used at inference time so that the representation can encode more information across the different target languages. Training of a cross-lingual query generator does not require additional training data to that used for the dense retriever. The query generator training is also effective because the pre-training task for the generator (T5 text-to-text training) is very similar to the fine-tuning task (generation of a query). The use of the generator does not increase query latency at inference and can be combined with any cross-lingual dense retrieval method. Results from experiments on a benchmark cross-lingual information retrieval dataset show that our approach can improve the effectiveness of existing cross-lingual dense retrieval methods. Implementation of our methods, along with all generated query files are made publicly available at https://github.com/ielab/xQG4xDR.
翻译:有效的跨语言稠密检索方法依赖于多语言预训练语言模型(PLMs),需要训练以同时涵盖相关性匹配任务和跨语言对齐任务。然而,用于训练的跨语言数据往往稀缺。本文提出使用跨语言查询生成技术,而非依赖更多跨语言训练数据,通过生成非原始段落语言的查询来增强段落表示。这些增强后的表示在推理阶段使用,使得表示能够编码更多跨不同目标语言的信息。跨语言查询生成器的训练不需要比稠密检索器更多的额外训练数据。查询生成器的训练同样有效,因为生成器的预训练任务(T5文本到文本训练)与微调任务(生成查询)高度相似。使用生成器不会增加推理时的查询延迟,并且可以与任何跨语言稠密检索方法结合使用。在基准跨语言信息检索数据集上的实验结果表明,我们的方法能够提升现有跨语言稠密检索方法的效果。我们的方法实现及所有生成的查询文件已在 https://github.com/ielab/xQG4xDR 公开提供。