Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

翻译：在提供充分解释性上下文的情况下，小型语言模型已在训练中未见问题的挑战性短答案问答任务上展现出强大的推理能力。我们评估了两种在该场景下进一步改进的方法。这两种方法均聚焦于将大型语言模型生成的推理理由与多跳密集检索系统产生的长上下文相结合。第一种方法（$\textit{RR}$）涉及训练一个理由排序模型，用于对生成的理由和检索到的上下文在相关性与真实性方面进行评分。随后我们利用这些评分，通过多种组合策略从两个知识源中推导出组合上下文。第二种方法（$\textit{RATD}$）中，我们使用检索增强训练数据集训练一个小型推理模型，使其熟练掌握从可能仅部分佐证、且常包含大量无关句子的长文本序列中提取相关信息的能力。总体而言，我们发现两种方法均有效，但$\textit{RATD}$方法更易实施，并在我们重点关注的未见场景中取得了最优结果。我们仅用4.4亿参数的最佳单一推理模型，在未见评估数据集上显著优于强可比基线（StrategyQA准确率58.9→61.7，CommonsenseQA准确率63.6→72.7，ARC-DA F1值31.6→52.1，IIRC F1值25.5→27.3）；而利用每类问题的先验知识选择上下文组合策略的版本表现更佳。我们提出的模型在少样本思维链与少样本纯答案两种设置下，通常也优于直接提示更大规模模型（BLOOM 175B与StableVicuna 13B）的结果。