Large pre-training language models (PLMs) have shown promising in-context learning abilities. However, due to the backbone transformer architecture, existing PLMs are bottlenecked by the memory and computational cost when scaling up to a large context size, leaving instruction tuning and in-context learning of many demonstration examples, as well as long-range language modeling under-explored. In this study, we propose a long-range language model EVALM based on an efficient transformer mechanism. EVALM is trained with 8k tokens per batch line and can test up to 256k-lengthed contexts with extrapolation, 128 times to the limit of existing PLMs (e.g. GPT3). Based on EVALM, we scale up the size of examples efficiently in both instruction tuning and in-context learning to explore the boundary of the benefits from more annotated data. Experimental results on a diverse set of tasks show that EVALM achieves 4.1% higher accuracy on average, and the average length of achieving the best accuracy score over tasks is around 12k. We find that in-context learning can achieve higher performance with more demonstrations under many-shot instruction tuning (8k), and further extending the length of instructions (16k) can further improve the upper bound of scaling in-context learning.
翻译:大规模预训练语言模型已展现出令人瞩目的上下文学习能力。然而,受限于骨干Transformer架构,现有预训练语言模型在扩展至大规模上下文时面临存储与计算瓶颈,导致基于大量演示示例的指令微调、上下文学习及长程语言建模仍未得到充分探索。本研究提出基于高效Transformer机制的长程语言模型EVALM。该模型以每批次8k令牌进行训练,并可通过外推法测试长达256k令牌的上下文,其长度达到现有预训练语言模型(如GPT3)的128倍。基于EVALM,我们在指令微调和上下文学习中高效扩展示例规模,以探索更多标注数据带来的效益边界。在多任务集上的实验结果表明,EVALM平均准确率提升4.1%,各任务达到最佳准确率分数所需的平均长度约为12k。研究发现,在多示例指令微调(8k)下,上下文学习可通过增加演示示例获得更高性能,而进一步扩展指令长度(16k)可继续提升上下文学习规模的上限。