We equip a smaller Language Model to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained Language Models in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.
翻译:我们赋予一个较小规模的语言模型泛化能力,使其能够回答训练中未曾见过的具有挑战性的组合问题。为此,我们提出结合多任务监督预训练(涵盖多达93个旨在培养多样化推理能力的任务)与密集检索系统(旨在检索一组证据性段落片段)的方案。近年来,问答领域的进展要么通过提示方法在零样本或小样本场景下利用超大规模预训练语言模型实现,要么通过微调较小模型(有时结合信息检索)达成。我们聚焦于一个较少被探索的问题:在检索语料库可能缺乏回答特定问题的充分信息时,较小模型能在多大程度上实现零样本泛化。我们在多样化的评估数据集(StrategyQA、CommonsenseQA、IIRC、DROP、Musique和ARC-DA)上建立了该场景下的强基线,并表明通过添加增强检索训练数据集(旨在让模型暴露于多种启发式推理策略,例如权衡部分证据或忽略无关上下文)可显著提升性能。