Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval

When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

翻译：在提供充分解释性上下文的情况下，小型语言模型已被证明在训练中未见问题的挑战性简短回答任务上展现出强大的推理能力。我们评估了两种在此场景下进一步改进的方法，两者均聚焦于将大型语言模型生成的推理理由与从多跳稠密检索系统创建的长上下文相结合。第一种方法（$\textit{RR}$）涉及训练一个理由排序模型，用于评估生成理由和检索上下文的相关性与真实性，随后利用分数通过多种组合策略从两种知识源导出组合上下文。第二种方法（$\textit{RATD}$）则通过使用检索增强训练数据集训练一个较小的推理模型，使其精通利用长文本序列中的相关信息——这些序列可能仅部分具有证据性且常包含大量无关句子。总体而言，我们发现两种方法均有效，但$\textit{RATD}$方法更易实施，并在我们重点关注的未见场景下取得最强结果。我们仅用4.4亿参数的最佳单一推理模型，在未见评估数据集上显著优于先前强基线（StrategyQA准确率从58.9提升至61.7，CommonsenseQA准确率从63.6提升至72.7，ARC-DA的F1值从31.6提升至52.1，IIRC的F1值从25.5提升至27.3）；而利用各类问题的先验知识选择上下文组合策略的版本则表现更优。在少样本思维链与少样本纯答案两种设定下，我们提出的模型通常在性能上超过使用直接提示的更大模型（BLOOM 175B与StableVicuna 13B）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日