Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.
翻译:自注意力机制常被视为概率化的查询-键值查找操作,这种观点催生了保留归一化注意力分数与固定位置语义的设计范式。我们提出一种更简洁且统一的视角:自回归注意力头可被视为动态双层MLP,其权重由上下文历史实例化生成。基于此视角,注意力分数构成了持续增长的隐层表征,而ReLU或GLU等标准MLP激活函数天然实现了对上下文相关记忆池(而非概率分布)的输入条件选择机制。基于该形式化框架,我们提出HyperMLP与HyperGLU模型,通过反向偏移(滞后)布局在特征空间与序列空间同时学习动态混合机制,使时序混合与自回归语义对齐。我们对该结构的表达能力与内在机理进行了理论刻画,并通过实验证明在参数预算匹配的条件下,HyperMLP/HyperGLU模型持续优于强softmax注意力基线模型。