The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.
翻译:机器学习模型将输入信息存储于隐藏层向量嵌入的能力——类似于“记忆”概念——被广泛采用但尚未得到充分表征。我们发现,无论训练过程中的数据规模与计算规模如何,语言模型嵌入通常仅包含相对较少的输入信息。相比之下,为输入重构而训练的自编码器能够实现近乎完美的记忆形成。用记忆嵌入替代词元序列可带来显著的计算效率提升,这促使我们提出一种可并行化的编码器-解码器记忆模型架构。经过因果训练后,这些模型会生成信息贫乏的嵌入,无法实现任意信息访问;但通过结合因果性与信息保留目标函数,模型能够学会构建并解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练策略(解码器先学习处理记忆,再学习额外预测下一词元),可进一步优化训练流程。我们提出以下观点:仅依赖下一词元预测训练本身难以形成精确记忆,因为该目标函数本身不可逆,这为需要处理非完全输入暴露的模型采用组合目标函数提供了理论依据。