Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study

Pseudo-Relevance Feedback (PRF) utilises the relevance signals from the top-k passages from the first round of retrieval to perform a second round of retrieval aiming to improve search effectiveness. A recent research direction has been the study and development of PRF methods for deep language models based rankers, and in particular in the context of dense retrievers. Dense retrievers, compared to more complex neural rankers, provide a trade-off between effectiveness, which is often reduced compared to more complex neural rankers, and query latency, which also is reduced making the retrieval pipeline more efficient. The introduction of PRF methods for dense retrievers has been motivated as an attempt to further improve their effectiveness. In this paper, we reproduce and study a recent method for PRF with dense retrievers, called ANCE-PRF. This method concatenates the query text and that of the top-k feedback passages to form a new query input, which is then encoded into a dense representation using a newly trained query encoder based on the original dense retriever used for the first round of retrieval. While the method can potentially be applied to any of the existing dense retrievers, prior work has studied it only in the context of the ANCE dense retriever. We study the reproducibility of ANCE-PRF in terms of both its training (encoding of the PRF signal) and inference (ranking) steps. We further extend the empirical analysis provided in the original work to investigate the effect of the hyper-parameters that govern the training process and the robustness of the method across these different settings. Finally, we contribute a study of the generalisability of the ANCE-PRF method when dense retrievers other than ANCE are used for the first round of retrieval and for encoding the PRF signal.

翻译：摘要：伪相关反馈（Pseudo-Relevance Feedback, PRF）通过利用首轮检索中 top-k 段落的相关性信号进行第二轮检索，旨在提升搜索效果。近期研究方向聚焦于基于深度语言模型的排序器（尤其是稠密检索器）的 PRF 方法开发与研究。与更复杂的神经排序器相比，稠密检索器在效果（通常低于复杂神经排序器）与查询延迟（更低，使检索流水线更高效）之间实现了权衡。引入针对稠密检索器的 PRF 方法，旨在进一步改善其效果。本文复现并研究了一种面向稠密检索器的近期 PRF 方法——ANCE-PRF。该方法将查询文本与 top-k 反馈段落文本拼接形成新查询输入，随后利用基于首轮检索原始稠密检索器训练的新查询编码器，将其编码为稠密表示。尽管该方法可潜在地应用于任意现有稠密检索器，但前期工作仅在 ANCE 稠密检索器下进行探讨。我们研究了 ANCE-PRF 在训练（PRF 信号编码）与推理（排序）阶段的可复现性，并拓展了原工作的实证分析，以探究控制训练过程的超参数影响及该方法在不同设置下的鲁棒性。最后，我们进一步研究了当首轮检索与 PRF 信号编码采用非 ANCE 的稠密检索器时，ANCE-PRF 方法的泛化能力。