Augmenting pretrained language models with retrievers has shown promise in effectively solving common NLP problems, such as language modeling and question answering. In this paper, we evaluate the strengths and weaknesses of popular retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD, Contriever + ATLAS, and Contriever + Flan-T5, in reasoning over retrieved statements across different tasks. Our findings indicate that the simple similarity metric employed by retrievers is insufficient for retrieving all the necessary statements for reasoning. Additionally, the language models do not exhibit strong reasoning even when provided with only the required statements. Furthermore, when combined with imperfect retrievers, the performance of the language models becomes even worse, e.g., Flan-T5's performance drops by 28.6% when retrieving 5 statements using Contriever. While larger language models improve performance, there is still a substantial room for enhancement. Our further analysis indicates that multihop retrieve-and-read is promising for large language models like GPT-3.5, but does not generalize to other language models like Flan-T5-xxl.
翻译:通过检索器增强预训练语言模型在有效解决常见自然语言处理问题(如语言建模和问答)方面已展现出潜力。本文评估了主流检索器增强型语言模型(即kNN-LM、REALM、DPR + FiD、Contriever + ATLAS以及Contriever + Flan-T5)在不同任务中对检索语句进行推理的优缺点。我们的研究发现,检索器采用的简单相似度量不足以检索出推理所需的全部必要语句。此外,即使仅提供必要的语句,语言模型也未表现出强大的推理能力。进一步地,当与不完美的检索器结合时,语言模型的性能会变得更差,例如,使用Contriever检索5条语句时,Flan-T5的性能下降了28.6%。尽管增大语言模型规模可提升性能,但仍有显著的改进空间。我们的进一步分析表明,多跳检索-读取范式对GPT-3.5等大型语言模型具有潜力,但无法推广至Flan-T5-xxl等其他语言模型。