Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training. To further improve performance, we propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a ``boosted prompt ensemble''. The few shot examples for each prompt are chosen in a stepwise fashion to be ``hard'' examples on which the previous step's ensemble is uncertain. We show that this outperforms single-prompt output-space ensembles and bagged prompt-space ensembles on the GSM8k and AQuA datasets, among others. We propose both train-time and test-time versions of boosted prompting that use different levels of available annotation and conduct a detailed empirical study of our algorithm.
翻译:链式推理提示与自一致性等方法在不增加训练的情况下推动了语言模型推理性能的前沿。为进一步提升性能,我们提出了一种针对大型语言模型的提示集成方法,该方法利用小型数据集构建一组少量样本提示,这些提示共同构成“增强提示集成”。每个提示的少量样本示例以逐步方式选择,选取前一步集成中不确定的“困难”样本。实验表明,该方法在GSM8k和AQuA等数据集上显著优于单提示输出空间集成和袋装提示空间集成。我们提出了增强提示的训练阶段与测试阶段版本,二者利用不同级别的可用标注,并对算法进行了详细的实证研究。