Linear recurrent networks (LRNNs) offer linear-time sequence modeling, but standard recurrent updates do not directly expose the supervised products needed for in-context gradient descent. We propose a sufficient constructive inductive bias for LRNNs: equip a diagonal recurrent state with multiplicative readout and a short sliding-window cross-product self-attention update. The resulting architecture, Gradient-based Recurrent In-context Learner (GRIL), can implement minibatch gradient descent on a task-specific linear predictor during a single forward pass. The same design extends to multi-step updates and cross-entropy classification, with a limited MLP-based extension to non-linear regression. Empirically, trained GRILs recover the behavior and parameters predicted by the construction on synthetic ICL tasks, and the same architectural bias yields useful performance on Long Range Arena and language modelling. These results present windowed cross-product self-attention as a practical, testable inductive bias for LRNNs that learn in context through gradient-descent-like updates.
翻译:线性循环网络(LRNNs)虽能实现线性时间的序列建模,但其标准循环更新方式无法直接提供上下文梯度下降所需的监督乘积。我们提出了一种针对LRNNs的充分构造性归纳偏置:为对角循环状态配备乘法读出机制与短滑动窗口的叉积自注意力更新。由此产生的架构——基于梯度的循环上下文学习器(GRIL),可在单次前向传播过程中针对任务特定的线性预测器执行小批量梯度下降。该设计可扩展至多步更新与交叉熵分类场景,并通过有限的多层感知机扩展支持非线性回归。实验表明,经过训练的GRIL模型在合成上下文学习任务中能够复现理论构造所预测的行为与参数,而相同的架构偏置在长距离竞技场和语言建模任务中也能产生有效的性能。这些结果表明,窗口化叉积自注意力可作为LRNNs的一种实用且可验证的归纳偏置,使其能够通过类梯度下降更新实现上下文学习。