The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.
翻译:最新一代大型语言模型(LLM)可通过提示在众多自然语言处理任务中实现卓越的零样本或少样本性能。然而,由于性能高度依赖于提示的选择,大量研究致力于众包提示或设计提示优化方法。但我们对提示的语言特性与任务性能之间的系统关联仍缺乏理解。本研究探讨了不同规模、预训练及指令调优的LLM在语义等价但语言结构不同的提示上表现如何。我们考察了语法特性(如语气、时态、体态和情态)以及通过同义词实现的词汇语义变化。研究结果与普遍假设相悖——LLM在反映预训练或指令调优数据语言使用模式的低困惑度提示上未必取得最优表现。提示在数据集或模型之间的迁移性较差,且性能通常无法通过困惑度、词频、歧义性或提示长度来解释。基于研究结论,我们提出更稳健且全面的提示研究评估标准。