Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
翻译:大型语言模型展现出从少量示例中学习解决新任务的卓越能力。提示模板,即输入示例为生成提示而采用的格式化方式,是上下文学习中重要但常被忽视的方面。本研究全面探究了模板格式对上下文学习性能的影响。我们评估了提示模板在多个模型(参数规模从7.7亿到700亿)及四个标准分类数据集上的作用。结果表明,不当的模板选择会使最强模型与推理方法的性能降至随机猜测水平。更关键的是,最佳模板在不同实验设置甚至同一系列的不同模型之间不具有可迁移性。我们的发现表明,当前忽视模板选择的评估范式可能因不同研究采用不同模板而产生误导性结果。作为缓解该问题的初步方案,我们提出"模板集成"方法,通过聚合多个模板的模型预测。这种简单的测试时增强策略在提升平均性能的同时,对随机模板集合的选取具有鲁棒性。