Large language models (LLMs) have exhibited an emergent in-context learning (ICL) ability. However, the ICL models that can solve ordinary cases are hardly extended to solve more complex tasks by processing the demonstration examples once. This single-turn ICL is incoordinate with the decision making process of humans by learning from analogy. In this paper, we propose an effective and efficient two-stage framework to boost ICL in LLMs by exploiting a dual form between Transformer attention and gradient descent-based optimization. Concretely, we divide the ICL process into "Deep-Thinking" and inference stages. The "Deep-Thinking" stage performs iterative forward optimization of demonstrations, which is expected to boost the reasoning abilities of LLMs at test time by "thinking" demonstrations multiple times. It produces accumulated meta-gradients by manipulating the Key-Value matrices in the self-attention modules of the Transformer. Then, the inference stage only takes the test query as input without concatenating demonstrations and applies the learned meta-gradients through attention for output prediction. In this way, demonstrations are not required during the inference stage since they are already learned and stored in the definitive meta-gradients. LLMs can be effectively and efficiently adapted to downstream tasks. Extensive experiments on ten classification and multiple-choice datasets show that our method achieves substantially better performance than standard ICL in terms of both accuracy and efficiency.
翻译:大型语言模型(LLMs)已展现出涌现的上下文学习(ICL)能力。然而,能解决一般案例的ICL模型难以通过一次性处理示范样本来扩展解决更复杂任务。这种单轮ICL与人类通过类比学习的决策过程不协调。本文提出一种高效的两阶段框架,通过利用Transformer注意力与基于梯度下降优化之间的对偶形式来增强LLMs的ICL能力。具体而言,我们将ICL过程分为“深度思考”与推理阶段。“深度思考”阶段对示范样本进行迭代前向优化,旨在通过多次“思考”示范样本来提升LLMs在测试时的推理能力。该阶段通过调整Transformer自注意力模块中的键-值矩阵来生成累积元梯度。随后,推理阶段仅以测试查询为输入(无需拼接示范样本),并通过注意力机制应用已学习的元梯度进行输出预测。通过这种方式,推理阶段不再需要示范样本,因为其已被学习并存储在确定的元梯度中。LLMs可有效且高效地适配下游任务。在十个分类与多选题数据集上的广泛实验表明,本方法在准确率与效率上均显著优于标准ICL。