Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $\textbf{MoVE (Mixture of Value Embeddings)}$, a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of "memory-dense" models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.
翻译:自回归序列建模作为现代生成式人工智能的基石,在从文本生成到图像生成等多种模态中驱动着卓越成果。然而,该范式的一个根本局限在于模型容量与计算成本之间的刚性结构耦合:扩展模型的参数记忆——即其事实知识或视觉模式的存储库——传统上需要加深或加宽网络,这会导致活跃浮点运算量成比例增加。在本研究中,我们引入$\textbf{MoVE(混合值嵌入)}$机制,该机制打破了这种耦合并建立了扩展容量的新维度。MoVE通过引入一个全局可学习值嵌入库(在所有注意力层间共享)实现了记忆与计算的解耦。对于序列中的每个步骤,模型采用可微分软门控机制动态地从该库中检索概念并混合到标准值投影中。这种架构允许通过简单地增加嵌入槽数量来独立于网络深度扩展参数记忆。我们通过对自回归建模的两个代表性应用——文本生成与图像生成——进行严格受控实验来验证MoVE。在这两个领域中,MoVE相较于标准及分层记忆基线均能带来持续的性能提升,使得构建"记忆密集型"模型成为可能,这些模型在可比计算预算下比密集对应模型具有更低的困惑度和更高的保真度。