Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
翻译:大型语言模型(LLMs)被广泛用作零样本和少样本分类器,其任务行为主要通过提示进行控制。越来越多的研究观察到LLMs对提示变化具有敏感性,微小的修改可能导致性能的巨大波动。然而,在许多情况下,对敏感性的研究使用的是欠规范的提示——这些提示仅提供最小化的任务指令,对模型的输出空间约束较弱。本研究中,我们认为观察到的提示敏感性在很大程度上可归因于提示的欠规范问题。我们系统性地研究并比较了欠规范提示与提供具体指令的提示在敏感性上的差异。通过性能分析、逻辑值分析和线性探测,我们发现欠规范提示表现出更高的性能方差和更低的相关词元逻辑值,而指令型提示则较少受此类问题影响。然而,线性探测分析表明,提示欠规范的影响对LLM内部表征仅产生边际效应,主要显现在最终网络层。总体而言,我们的研究结果强调,在研究和缓解提示敏感性时需要更严谨的方法论。