Retrieval with extremely long queries and documents is a well-known and challenging task in information retrieval and is commonly known as Query-by-Document (QBD) retrieval. Specifically designed Transformer models that can handle long input sequences have not shown high effectiveness in QBD tasks in previous work. We propose a Re-Ranker based on the novel Proportional Relevance Score (RPRS) to compute the relevance score between a query and the top-k candidate documents. Our extensive evaluation shows RPRS obtains significantly better results than the state-of-the-art models on five different datasets. Furthermore, RPRS is highly efficient since all documents can be pre-processed, embedded, and indexed before query time which gives our re-ranker the advantage of having a complexity of O(N) where N is the total number of sentences in the query and candidate documents. Furthermore, our method solves the problem of the low-resource training in QBD retrieval tasks as it does not need large amounts of training data, and has only three parameters with a limited range that can be optimized with a grid search even if a small amount of labeled data is available. Our detailed analysis shows that RPRS benefits from covering the full length of candidate documents and queries.
翻译:超长查询与文档的检索是信息检索领域公认的挑战性任务,通常称为按文档查询(QBD)检索。此前研究表明,专门设计用于处理长序列输入的Transformer模型在QBD任务中并未展现出高有效性。本文提出一种基于新型比例相关性分数(RPRS)的重排序器,用于计算查询与top-k候选文档间的相关性得分。广泛评估表明,RPRS在五个不同数据集上均取得显著优于现有最优模型的结果。此外,RPRS具有极高效率——所有文档可在查询前完成预处理、嵌入与索引,使得重排序器复杂度仅为O(N)(N为查询与候选文档的总句子数)。同时,该方法解决了QBD检索任务中低资源训练问题:无需大量训练数据,仅需三个有限范围的参数,即使仅有少量标注数据也可通过网格搜索优化。深入分析表明,RPRS的性能得益于对候选文档与查询完整长度的覆盖。