Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.
翻译:大型语言模型(LLMs)展现出令人印象深刻的能力,能够利用其输入序列上下文中的信息,对训练过程中未见过的数据做出恰当响应。这种能力被称为上下文学习(ICL)。人类与非人类动物也表现出类似能力,然而其神经架构与LLMs存在本质差异。尽管如此,LLMs中的一个关键组件——注意力机制——与现代联想记忆模型相似,后者被计算神经科学界广泛用于模拟生物记忆系统并受其影响。基于这一关联,我们提出了一种能够执行ICL的联想记忆模型。受此启发,我们设计了一种新颖的残差流架构,允许信息在注意力头之间直接流动。我们在一个双层Transformer的训练过程中测试了该架构,结果表明其ICL能力的显现速度优于未修改的模型。随后,我们将该架构应用于参数量为800万的小型语言模型,重点关注注意力头的值向量,结果同样表明在此更大且更接近自然场景的规模下,ICL性能得到了提升。