Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models' performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.
翻译:近期针对基于Transformer的大型语言模型的研究揭示了其工作记忆容量的显著限制,这与人类行为研究中发现的规律相似。具体而言,这些模型在N-back任务上的性能随着N的增加而显著下降。然而,关于这一现象产生的机制仍缺乏可解释性。受行为科学中执行注意理论的启发,我们提出假设:基于Transformer模型中的自注意力机制可能是其工作记忆容量受限的原因。为验证这一假设,我们训练了仅包含解码器的原始Transformer模型执行N-back任务,发现注意力分数在训练过程中逐渐聚集到N-back位置,这表明模型通过学习关注当前位置与N-back位置之间关系的策略来掌握任务。关键的是,我们发现注意力分数矩阵的总熵随着N的增加而增加,这表明注意力分数的分散性可能是N-back任务中观察到的容量限制的原因。