When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.
翻译:在提供充分解释性上下文的情况下,小型语言模型已被证明在训练中未见问题的挑战性简短回答任务上展现出强大的推理能力。我们评估了两种在此场景下进一步改进的方法,两者均聚焦于将大型语言模型生成的推理理由与从多跳稠密检索系统创建的长上下文相结合。第一种方法($\textit{RR}$)涉及训练一个理由排序模型,用于评估生成理由和检索上下文的相关性与真实性,随后利用分数通过多种组合策略从两种知识源导出组合上下文。第二种方法($\textit{RATD}$)则通过使用检索增强训练数据集训练一个较小的推理模型,使其精通利用长文本序列中的相关信息——这些序列可能仅部分具有证据性且常包含大量无关句子。总体而言,我们发现两种方法均有效,但$\textit{RATD}$方法更易实施,并在我们重点关注的未见场景下取得最强结果。我们仅用4.4亿参数的最佳单一推理模型,在未见评估数据集上显著优于先前强基线(StrategyQA准确率从58.9提升至61.7,CommonsenseQA准确率从63.6提升至72.7,ARC-DA的F1值从31.6提升至52.1,IIRC的F1值从25.5提升至27.3);而利用各类问题的先验知识选择上下文组合策略的版本则表现更优。在少样本思维链与少样本纯答案两种设定下,我们提出的模型通常在性能上超过使用直接提示的更大模型(BLOOM 175B与StableVicuna 13B)。