Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length-$L$ sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase'' of gradient descent and yields explicit formulas for the model's storage capacity -- revealing a multiplicative dependence between sample size $N$, embedding dimension $d$, and sequence length $L$. We validate these scalings numerically and further complement them with a lower bound for the underlying statistical problem, demonstrating that this multiplicative scaling is intrinsic under non-orthogonal embeddings.
翻译:现代大型语言模型(LLM)在需要存储与检索知识的任务中表现出色,例如事实回忆与问答任务。Transformer架构是实现此能力的核心,因其能够在训练阶段编码信息并在推理阶段进行检索。现有理论分析通常在理想化假设下研究Transformer,例如无限数据或正交嵌入。然而在实际场景中,模型是在有限数据集上通过非正交(随机)嵌入进行训练的。为填补这一研究空白,我们分析了采用随机嵌入的单层Transformer模型在简单令牌检索任务上的(经验)梯度下降训练过程。该任务要求模型从长度为$L$的序列中识别信息性令牌,并学习从令牌到标签的一对一映射关系。我们的分析追踪梯度下降的“早期阶段”,推导出模型存储容量的显式表达式——揭示了样本量$N$、嵌入维度$d$与序列长度$L$之间的乘积依赖关系。我们通过数值实验验证了这些缩放规律,并进一步结合底层统计问题的下界分析,证明这种乘积缩放关系在非正交嵌入条件下具有内在必然性。