One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models.
翻译:当前大语言模型面临的一个突出问题是其有限的上下文长度。近期如GPT-4和Claude 2等专有模型分别引入了8k/32k和100k的更长上下文窗口;然而,尽管社区付出了诸多努力,大多数常见模型(如LLama-2)的上下文长度仍为4k或更短。Unlimiformer(Bertsch等人,2023)是近期流行的向量检索增强方法,它将交叉注意力计算卸载到k近邻索引中。但该方法的主要局限在于无法直接兼容仅解码器Transformer架构。本研究探讨了将Unlimiformer适配到仅解码器Transformer的实际考量,并提出一系列改进方案以突破此限制。此外,我们在原有摘要生成实验基础上扩展了新的任务类型(即自由形式问答)和指令微调模型(即定制的6.7B参数GPT模型)。实验结果表明,改进方案在摘要任务中效果显著,其性能与上下文长度翻倍的模型相当。最后,我们讨论了自由形式问答与指令微调模型的现存局限及未来发展方向。