Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.
翻译:大型语言模型(LLM)是能够通过上下文学习(ICL)在推理阶段学习概念的强大模型。尽管已有理论研究(例如\cite{zhang2023trained})尝试解释ICL的机制,但这些研究通常假设每个示例的输入$x_i$与输出$y_i$处于同一标记中(即结构化数据)。然而在实际应用中,示例通常为文本输入,且所有词汇无论其逻辑关系如何,均存储于不同的标记中(即非结构化数据\cite{wibisono2023role})。为理解LLM如何从ICL的非结构化数据中学习,本文研究了Transformer架构中各组件的作用,并提供了理论解释以阐明该架构的成功机制。具体而言,我们采用具有一/两层注意力层的简单Transformer模型,并针对ICL预测任务构建线性回归问题进行分析。研究发现:(1)具有两层(自)注意力层且采用前瞻注意力掩码的Transformer能够从非结构化数据的提示中学习;(2)位置编码能够匹配$x_i$与$y_i$标记,从而提升ICL性能。