Augmenting pretrained language models with retrievers to select the supporting documents has shown promise in effectively solving common NLP problems, including language modeling and question answering, in an interpretable way. In this paper, we first study the strengths and weaknesses of different retriever-augmented language models (REALM, $k$NN-LM, FiD coupled with DPR, and ATLAS and Flan-T5 coupled with Contriever) in reasoning over the retrieved statements in different tasks. We show how the retrieve-then-read models' limitations in reasoning are rooted both in the retriever module as well as the language model. Our experimental results demonstrate that the similarity metric used by the retrievers is generally insufficient for reasoning tasks. Additionally, we show that the language models in retriever-augmented models do not take the complicated relations between the statements into account, which leads to poor reasoning performance even when using the larger models. Moreover, we analyze the reasoning performance of large language models using multihop retrieval but we only observe minor improvements. Overall, this shows great room for further research in this area.
翻译:通过检索器增强预训练语言模型以选取支持性文档,在可解释地解决常见自然语言处理问题(包括语言建模和问答)方面展现出显著潜力。本文首先研究了不同检索增强型语言模型(REALM、$k$NN-LM、结合DPR的FiD、结合Contriever的ATLAS和Flan-T5)在不同任务中对检索语句进行推理的优劣势。我们揭示了"检索-读取"模型在推理能力上的局限性既源于检索模块,也源于语言模型。实验结果表明,检索器使用的相似度度量通常不足以完成推理任务。此外,我们发现检索增强模型中的语言模型未能充分考虑语句间的复杂关系,即便使用更大规模的模型,推理性能依然不佳。进一步地,我们分析了采用多跳检索的大型语言模型的推理表现,但仅观察到微弱改进。总体而言,这表明该领域仍存在广阔的研究空间。