In this paper, we consider the extent to which the transformer-based Dense Passage Retrieval (DPR) algorithm, developed by (Karpukhin et. al. 2020), can be optimized without further pre-training. Our method involves two particular insights: we apply the DPR context encoder at various phrase lengths (e.g. one-sentence versus five-sentence segments), and we take a confidence-calibrated ensemble prediction over all of these different segmentations. This somewhat exhaustive approach achieves start-of-the-art results on benchmark datasets such as Google NQ and SQuAD. We also apply our method to domain-specific datasets, and the results suggest how different granularities are optimal for different domains
翻译:在本文中,我们探讨了由Karpukhin等人(2020)提出的基于Transformer的密集段落检索(DPR)算法在无需进一步预训练的情况下可优化的程度。我们的方法基于两个关键洞察:首先,我们将DPR上下文编码器应用于不同长度的短语(例如,单句与五句片段);其次,我们对所有这些不同分段的结果进行置信度校准的集成预测。这种近乎穷举的方法在Google NQ和SQuAD等基准数据集上取得了最先进的结果。我们还将该方法应用于特定领域的数据集,结果表明不同粒度在不同领域中具有最优性。