In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.
翻译:近年来,大语言模型(LLMs)在规模化扩展中展现出卓越能力,尤其在基于提示条件生成文本方面表现突出。本研究探索利用大语言模型为小型语言模型(SLMs)自动生成反事实(CF)实例(即最小化修改后的输入),以增强其训练数据,从而提升抽取式问答(QA)场景下小型语言模型的域外(OOD)性能。研究表明,无论采用何种大语言模型生成器,此类数据增强方法均能持续提升域外性能,并改善基于置信度与基于理由增强的校准器模型的校准效果。此外,这些性能改进与反事实实例在表面形式和语义内容上的更高多样性呈正相关。最后,我们发现更易校准的反事实增强模型在重要性分配时熵值显著更低,这表明基于理由增强的校准器倾向于简洁解释。