Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) are robust in some setups but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. First, we show how spurious correlations between input distributions and labels -- a known issue in TT models -- form only a minor problem for prompted models. Then, we engage in a systematic, holistic evaluation of different factors that have been found to influence predictions in a prompting setup. We test all possible combinations of a range of factors on both vanilla and instruction-tuned (IT) LLMs of different scale and statistically analyse the results to show which factors are the most influential, interactive or stable. Our results show which factors can be used without precautions and which should be avoided or handled with care in most settings.
翻译:寻找将预训练语言模型适配至特定任务的最佳方法,是当前自然语言处理领域的重大挑战。如同上一代任务微调模型(TT)一样,通过上下文学习(ICL)适配任务的模型在某些设置中表现鲁棒,但在其他设置中则不然。本文详细分析了哪些设计选择会导致大型语言模型(LLM)预测出现不稳定和不一致现象。首先,我们展示了输入分布与标签之间的虚假相关性——这在TT模型中是一个已知问题——对于提示模型而言仅构成次要问题。随后,我们对提示设置中已知影响预测的不同因素进行了系统性的整体评估。我们在不同规模的原始及指令微调(IT)LLM上测试了多种因素的所有可能组合,并通过统计分析揭示了哪些因素最具影响力、交互性或稳定性。研究结果表明,在大多数设置中,哪些因素可以毫无顾虑地使用,哪些则需避免或谨慎处理。