Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.
翻译:循环网络进行在线自适应时无需通过雅可比传播。隐藏状态已通过前向传播携带时间信用;若停止使用过时的迹记忆破坏即时导数,并在参数组间归一化梯度尺度,仅凭即时导数便已足够。一项架构规则可预测何时需要归一化:当梯度必须经过非线性状态更新且无输出旁路时,需使用β²;否则无需归一化。在十种架构、真实的灵长类神经数据及流式机器学习基准测试中,采用RMSprop的即时导数表现与完整RTRL相当甚至更优,可扩展至n=1024且内存消耗降低1000倍。