In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
翻译:上下文学习(ICL)方法通常利用提示词来引导仅解码器语言模型基于参考信息进行生成。由于自注意力操作的二次复杂度,即时处理上下文效率低下,因此缓存技术具有重要价值。然而缓存Transformer状态所需空间极易接近模型参数量,且当正确上下文无法预先获知时,ICL缓存面临巨大挑战。本文受编码器-解码器架构启发,引入通过交叉注意力机制实现无需提示词即可基于参考文本进行条件生成的新型模型。具体而言,我们利用预训练仅解码器模型,仅需训练少量新增网络层。以问答任务(QA)作为测试床评估模型的条件生成能力,实验表明:本文模型性能超越ICL方法,可与经微调的提示式LLM媲美,且相较于标准KV缓存将空间占用降低两个数量级。