We equip a smaller Language Model to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained Language Models in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.
翻译:我们赋予小型语言模型泛化回答训练中未见过的具有挑战性的组合性问题的能力。为此,我们提出将多任务监督预训练(最多涵盖93项任务,旨在培养多样化的推理能力)与密集检索系统(用于检索一组证据性段落片段)相结合。近年来的问答领域进展或通过针对超大预训练语言模型的零样本/少样本提示方法实现,或通过微调小型模型(有时结合信息检索)达成。我们聚焦于一个较少探索的问题:在检索语料库可能不包含回答特定问题的充分信息的情况下,小型模型能实现何种程度的零样本泛化。我们为此场景下多个评估数据集(StrategyQA、CommonsenseQA、IIRC、DROP、Musique和ARC-DA)建立了强基线,并证明通过添加检索增强训练数据集——这些数据集旨在让模型接触多种启发式推理策略(如权衡部分证据或忽略无关上下文)——可显著提升性能。