When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we utilise retrieval-augmented training datasets developed by Hartill et al. 2023 to train a smaller Reasoning model such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. We find that both methods significantly improve results. Our single best Reasoning model materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and standard few-shot settings.
翻译:在提供充分解释性上下文的情况下,小型语言模型已被证明能在训练中未见问题的挑战性短答案问答任务上展现强大推理能力。我们针对该场景评估了两种进一步的改进方法。两种方法均聚焦于将大型语言模型生成的推理与多跳稠密检索系统构建的长上下文相结合。第一种方法($\textit{RR}$)涉及训练推理排序模型,根据相关性与真实性对生成的推理和检索到的上下文进行评分,随后通过多种组合策略从两类知识源中推导出组合上下文。第二种方法($\textit{RATD}$)则利用Hartill等人2023年提出的检索增强训练数据集,训练小型推理模型使其擅长利用仅具部分证据性且常含大量无关句子的长文本序列中的相关信息。我们发现两种方法均能显著提升结果。在未见评估数据集上,我们最佳单一推理模型较以往强基线有实质性改进(StrategyQA准确率从58.9提升至61.7,CommonsenseQA准确率从63.6提升至72.7,ARC-DA的F1值从31.6提升至52.1,IIRC的F1值从25.5提升至27.3),而利用每类问题先验知识选择上下文组合策略的版本表现更优。我们提出的模型在少样本思维链与标准少样本两种场景下,通常也优于直接提示的更大模型(BLOOM 175B与StableVicuna 13B)。