Neural ranking methods based on large transformer models have recently gained significant attention in the information retrieval community, and have been adopted by major commercial solutions. Nevertheless, they are computationally expensive to create, and require a great deal of labeled data for specialized corpora. In this paper, we explore a low resource alternative which is a bag-of-embedding model for document retrieval and find that it is competitive with large transformer models fine tuned on information retrieval tasks. Our results show that a simple combination of TF-IDF, a traditional keyword matching method, with a shallow embedding model provides a low cost path to compete well with the performance of complex neural ranking models on 3 datasets. Furthermore, adding TF-IDF measures improves the performance of large-scale fine tuned models on these tasks.
翻译:基于大型Transformer模型的神经排序方法近年来在信息检索领域引起了广泛关注,并被主要商业解决方案所采用。然而,这些模型的构建计算成本高昂,且需要大量标注数据来适应特定语料库。本文探索了一种低资源替代方案——基于词袋嵌入模型的文档检索方法,并发现该方案与在信息检索任务上微调的大型Transformer模型具有竞争力。我们的结果表明,将传统关键词匹配方法TF-IDF与浅层嵌入模型简单结合,可以低成本地在3个数据集上与复杂神经排序模型的性能相抗衡。此外,添加TF-IDF度量还能提升在这些任务上微调的大规模模型的性能。