Agile Reinforcement Learning through Separable Neural Architecture

Deep reinforcement learning (RL) is increasingly deployed in resource-constrained environments, yet the go-to function approximators - multilayer perceptrons (MLPs) - are often parameter-inefficient due to an imperfect inductive bias for the smooth structure of many value functions. This mismatch can also hinder sample efficiency and slow policy learning in this capacity-limited regime. Although model compression techniques exist, they operate post-hoc and do not improve learning efficiency. Recent spline-based separable architectures - such as Kolmogorov-Arnold Networks (KANs) - have been shown to offer parameter efficiency but are widely reported to exhibit significant computational overhead, especially at scale. In seeking to address these limitations, this work introduces SPAN (SPline-based Adaptive Networks), a novel function approximation approach to RL. SPAN adapts the low rank KHRONOS framework by integrating a learnable preprocessing layer with a separable tensor product B-spline basis. SPAN is evaluated across discrete (PPO) and high-dimensional continuous (SAC) control tasks, as well as offline settings (Minari/D4RL). Empirical results demonstrate that SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to MLP baselines. Furthermore, SPAN demonstrates superior anytime performance and robustness to hyperparameter variations, suggesting it as a viable, high performance alternative for learning intrinsically efficient policies in resource-limited settings.

翻译：深度强化学习（RL）正越来越多地部署在资源受限的环境中，然而常用的函数逼近器——多层感知机（MLP）——由于对许多价值函数的平滑结构存在不完美的归纳偏置，往往参数效率低下。这种不匹配在容量受限的情况下也会阻碍样本效率并减缓策略学习。尽管存在模型压缩技术，但它们属于事后操作，无法提高学习效率。近期基于样条的可分离架构——例如Kolmogorov-Arnold网络（KANs）——已被证明能提供参数效率，但广泛报道显示其存在显著的计算开销，尤其是在大规模场景下。为应对这些局限性，本研究引入了SPAN（基于样条的自适应网络），一种用于强化学习的新型函数逼近方法。SPAN通过将可学习的预处理层与可分离的张量积B样条基相结合，对低秩KHRONOS框架进行了改进。SPAN在离散（PPO）和高维连续（SAC）控制任务以及离线设置（Minari/D4RL）中进行了评估。实证结果表明，与MLP基线相比，SPAN在样本效率上实现了30-50%的提升，并在各基准测试中取得了1.3至9倍更高的成功率。此外，SPAN展现出卓越的任意时间性能以及对超参数变化的鲁棒性，表明其可作为在资源受限环境中学习本质高效策略的一种可行且高性能的替代方案。