Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.
翻译:大型语言模型已展现出令人印象深刻的事实性记忆能力。先前研究发现,在事实性记忆任务上训练的Transformer能以与其参数量成比例的速率存储信息。本研究表明,浅层Transformer可通过关联记忆的组合实现接近最优的存储容量。我们首先证明了线性关联记忆与多层感知机关联记忆的存储容量均随参数量线性增长。随后我们引入合成事实性记忆任务,并证明当自注意力参数总量或MLP参数总量(忽略对数因子)与事实数量呈线性关系时,仅含单层自注意力与MLP的Transformer可在该任务上达到100%准确率。特别值得注意的是,该Transformer可通过价值矩阵或MLP作为关联记忆来存储事实数据集,实现两种机制间的灵活权衡。我们在表达性分析的基础上,进一步研究了简化线性注意力模型在事实性记忆任务中的梯度流轨迹,发现该模型表现出序列学习行为。