Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}
翻译:基于提示的大语言模型(LLMs)在生成多步问答(QA)的自然语言推理步骤或思维链(CoT)方面表现出惊人的能力。然而,当所需知识在LLM中不可用或在其参数中未及时更新时,这些模型会遭遇困难。虽然利用问题从外部知识源检索相关文本有助于LLMs,但我们观察到这种单步检索-阅读方法对于多步问答而言并不充分。在此,\textit{需要检索什么}取决于\textit{已经推导出的内容},而后者又可能依赖于\textit{之前检索到的内容}。为解决这一问题,我们提出IRCoT,一种针对多步问答的新方法,它将检索与CoT中的步骤(句子)交织在一起,用CoT指导检索,并利用检索结果改进CoT。在GPT3上使用IRCoT显著提升了四个数据集(HotpotQA、2WikiMultihopQA、MuSiQue和IIRC)的检索性能(最多提升21个点)以及下游问答性能(最多提升15个点)。我们在分布外(OOD)设置中以及使用更小的模型(如无需额外训练的Flan-T5-large)中,也观察到了类似的显著增益。IRCoT减少了模型幻觉,从而实现了事实性更准确的CoT推理。代码、数据和提示示例见\url{https://github.com/stonybrooknlp/ircot}。