In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.
翻译:近年来,大型语言模型(LLMs)在规模上展现出卓越能力,尤其在基于提示生成文本方面。本文研究利用LLMs为小型语言模型(SLMs)自动生成反事实(CF)实例(即最小化修改的输入)以增强训练数据,旨在提升抽取式问答(QA)场景中SLMs的域外(OOD)性能。研究表明,在各种LLM生成器下,此类数据增强方法能持续提升OOD性能,并改善基于置信度与基于解释的校准器模型的模型校准效果。此外,这些性能提升与CF实例在表层形式和语义内容上的多样性呈正相关。最后,我们证明易于校准的CF增强模型在重要性分配时熵值显著更低,这表明基于解释的校准器更倾向简洁的解释。