BatchPrompt: Accomplish more with less

As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our comprehensive experimental evaluation demonstrates that BPE can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting(SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt with batch size 32, using just 9%-16% the number of LLM calls, Boolq accuracy 90.6% to 90.9% with 27.4% tokens, QQP accuracy 87.2% to 88.4% with 18.6% tokens, RTE accuracy 91.5% to 91.1% with 30.8% tokens). To the best of our knowledge, this is the first work to technically improve prompting efficiency of large language models. We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released.

翻译：随着大语言模型（LLMs）的token限制持续提升，长上下文输入已成为可能，以单个数据样本进行提示的方式可能不再高效。一种直接的效率提升策略是在token限制内批量处理数据（例如，gpt-3.5-turbo为8k；GPT-4为32k），我们称之为BatchPrompt。关于批量数据提示，我们有两个初步观察。首先，我们发现与单数据提示相比，在较长上下文中使用批量数据提示将不可避免地导致性能下降。其次，由于解码器上下文对应的变化，语言模型的性能与批量数据的位置和顺序显著相关。为保持效率并克服性能损失，我们提出了批量排列集成方法（BPE）以及一种新颖的自我反思引导早期停止技术（SEAS）。我们的综合实验评估表明，BPE能在多个主流自然语言处理任务（包括问答任务Boolq、文本蕴含任务RTE和重复问题识别任务QQP）上显著提升BatchPrompt的性能。这些性能甚至与单数据提示（SinglePrompt）相当或更高，而BatchPrompt所需的LLM调用次数和输入token大幅减少（以BatchPrompt批大小32为例：与SinglePrompt相比，LLM调用次数仅需9%-16%，Boolq准确率从90.6%提升至90.9%且token量仅占27.4%，QQP准确率从87.2%提升至88.4%且token量仅占18.6%，RTE准确率从91.5%调整为91.1%且token量仅占30.8%）。据我们所知，这是首个从技术层面提升大语言模型提示效率的研究。我们期望这种简单有效的方法能对未来大语言模型研究有所启发。相关代码将开源。