Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
翻译:大型语言模型展现出从少量示例中学习解决新任务的卓越能力。提示模板,即输入示例格式化以生成提示的方式,是情境学习中重要但常被忽视的方面。本研究全面探究了模板格式对情境学习性能的影响。我们评估了提示模板对多种模型(参数量从770M到70B)及4个标准分类数据集的影响。结果表明,不当的模板选择可能使最强模型与推理方法的性能降至随机猜测水平。更重要的是,最优模板在不同设置间甚至同系列模型间均不具有迁移性。研究发现表明,当前主流的评估方法(忽视模板选择)可能因不同研究采用不同模板而产生误导性结果。作为初步解决方案,我们提出模板集成方法,通过聚合模型在多个模板上的预测结果。这种简单的测试时增强策略能在提升平均性能的同时,保持对随机模板集选择的鲁棒性。