Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompt-tuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.
翻译:提示微调是一种新兴策略,通过从数据中学习(软)提示参数,使大型语言模型适应下游任务。尽管该策略在大型语言模型中取得了成功,但关于提示微调能力及注意力机制在提示中作用的理论理解仍十分有限。本文针对单层注意力架构探索提示微调,并研究了上下文混合模型(其中每个输入标记隶属于上下文相关或无关集合)。我们通过一个自包含的提示注意力模型分离了提示微调的作用。主要贡献如下:(1) 在上下文数据模型下,证明softmax-提示-注意力比softmax-自注意力和线性-提示-注意力具有可证明的更强的表达能力;(2) 分析梯度下降的初始轨迹,证明其能以近最优的样本复杂度学习提示和预测头,并展示提示如何可证明地关注稀疏的上下文相关标记;(3) 假设已知提示但未知预测头,我们刻画了提示注意力的精确有限样本性能,从而揭示了基本性能极限及上下文信息的精确益处。此外,我们通过实验在真实数据集上验证了理论洞见,并展示了提示微调如何使模型关注上下文相关信息。