Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

翻译：随着大型语言模型（LLMs）被广泛采纳为语言技术的基础组件，准确刻画其性能变得至关重要。由于提示设计的选择会显著影响模型行为，这一设计过程对于有效使用任何现代预训练生成式语言模型至关重要。在本研究中，我们聚焦于LLMs对一类典型意义保持设计选择——提示格式——的敏感性。研究发现，几个广泛使用的开源LLMs在少样本设置中对提示格式的细微变化极为敏感，当使用LLaMA-2-13B进行评估时，性能差异高达76个准确率百分点。即使增加模型规模、少样本示例数量或进行指令微调，这种敏感性依然存在。我们的分析表明，采用基于提示的方法评估LLMs的研究，应报告一组合理提示格式下的性能范围，而非当前标准的单一格式性能报告。我们还发现格式性能在模型间仅存在弱相关性，这质疑了使用任意固定提示格式进行模型比较的方法学有效性。为促进系统性分析，我们提出FormatSpread算法，该算法能快速评估给定任务的一组采样合理提示格式，并在不访问模型权重的情况下报告预期性能区间。此外，我们通过一系列分析刻画了这种敏感性的本质，包括探索特定原子扰动的影响以及特定格式的内部表示。