We investigate the robustness of fine-tuned Large Language Models (LLMs) for the task of Natural Language Inference (NLI), finding that the in-distribution gains from fine-tuning correspond to a large drop in out-of-distribution (OOD) performance. Despite the widespread use of closed-source LLMs, there are no robustness mitigation methods that work under their API fine-tuning constraints. Existing methods to improve robustness typically require changing the fine-tuning process or large-scale data augmentation, methods that are infeasible or cost prohibitive for closed-source models. To address this, we propose strategically selecting the NLI fine-tuning data, prioritising more complex examples or replacing existing training examples with LLM-generated data. Prioritising more complex training examples improves performance on challenging OOD NLI datasets, while training with synthetic data leads to substantial improvements on easier OOD datasets. We find that synthetic examples are often too simple, and by prompting LLMs to create more complex synthetic data we can improve performance on both easy and challenging OOD datasets. Finally, we show that recent autoregressive LLMs are substantially more robust to distributional shifts compared to encoder models, and should be a preferred baseline for future research.
翻译:本研究探讨了针对自然语言推理任务进行微调的大语言模型的鲁棒性,发现微调带来的分布内性能提升往往伴随着分布外性能的大幅下降。尽管闭源大语言模型已被广泛使用,但目前尚无能在其API微调约束条件下有效缓解鲁棒性下降的方法。现有提升鲁棒性的方法通常需要改变微调流程或进行大规模数据增强,这些方法对于闭源模型而言要么不可行,要么成本过高。为解决此问题,我们提出一种策略性选择自然语言推理微调数据的方法,优先选择更复杂的训练样本,或利用大语言模型生成的数据替换原有训练样本。优先选择更复杂的训练样本能提升模型在具有挑战性的分布外自然语言推理数据集上的性能,而使用合成数据进行训练则能在较简单的分布外数据集上带来显著改进。我们发现合成样本往往过于简单,通过提示大语言模型生成更复杂的合成数据,我们可以在简单和困难的分布外数据集上同时实现性能提升。最后,我们证明相较于编码器模型,近期出现的自回归大语言模型对分布偏移具有显著更强的鲁棒性,应作为未来研究优先考虑的基线模型。