Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings.

翻译：基于Transformer的大型语言模型（LLM）展现出卓越的创造能力和涌现特性。现有实证研究表明，这些LLM的显著涌现能力与其上下文学习（ICL）能力密切相关，使其仅通过任务特定提示即可解决新任务而无需进一步微调。另一方面，现有实证与理论研究也表明，基于Transformer的LLM背后存在多概念编码语义表示的线性规律性。然而，现有理论工作未能建立对此规律性与ICL创新潜力之间关联的理解。此外，先前研究常聚焦于涉及线性Transformer或不现实损失函数的简化非现实场景，且仅获得线性或次线性收敛速率。相比之下，本研究通过细粒度数学分析，阐明Transformer如何利用词语的多概念语义实现强大的ICL能力及优异的分布外ICL性能，从而揭示Transformer如何为某些编码了多重跨概念语义的未见任务创新解决方案。受LLM线性潜在几何结构实证研究的启发，本分析基于概念驱动的低噪声稀疏编码提示模型。通过运用先进技术，本研究展示了在高度非凸训练动态中指数级0-1损失收敛的过程，开创性地涵盖了softmax自注意力机制、ReLU激活的多层感知机以及交叉熵损失的挑战。实证模拟结果验证了理论发现。