Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.
翻译:指令微调近期成为一种提升大型语言模型(LLMs)在新任务上零样本能力的有前景方法。该技术在提升中等规模LLMs性能方面展现出特别优势,有时能使其性能与更大规模的模型变体相媲美。本文提出两个问题:(1)指令微调模型对特定指令措辞的敏感程度如何?(2)如何增强其对自然语言变异的鲁棒性?针对前者,我们收集了由NLP实践者为广泛使用的基准测试中80余个独特任务手动编写的319条指令,评估这些指令相较于指令微调过程中观测到的措辞所产生的方差与平均性能。研究发现,使用新颖(未观测到)但适当的指令措辞会持续降低模型性能,有时降幅显著。此外,这些自然语言指令虽语义等价,却在下游任务中产生广泛的性能方差。换言之,指令微调模型对指令重述的鲁棒性并不突出。我们提出一种简单方法缓解该问题:引入“软提示”嵌入参数,并通过优化使语义等价指令的表征间相似性最大化。实验表明,该方法能持续提升指令微调模型的鲁棒性。