Unified Policy Value Decomposition for Rapid Adaptation

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

翻译：在复杂控制系统中，快速适应仍是强化学习面临的核心挑战。我们提出一种框架，其中策略函数与价值函数共享低维系数向量——即目标嵌入（goal embedding），该嵌入捕获任务特征，无需重新训练表示即可即时适应新任务。在预训练阶段，我们通过双线性演员-评论家分解联合学习结构化价值基函数与兼容策略基函数。评论家分解为Q = sum_k G_k(g) y_k(s,a)，其中G_k(g)是目标条件系数向量，y_k(s,a)是学习到的价值基函数。这种乘法门控机制——上下文信号通过缩放一组状态依赖基函数——类似于第五层锥体神经元中观察到的增益调制现象，其中自上而下的输入调整感觉驱动响应的增益而不改变其调谐特性。基于后继特征（Successor Features），我们将该分解扩展至演员，使其通过相同系数G_k(g)加权组合一组基础策略。测试阶段基函数被冻结，G_k(g)通过单次前向传播实现零样本估计，无需任何梯度更新即可即时适应新任务。我们在MuJoCo Ant环境中训练软演员-评论家（Soft Actor-Critic）智能体执行多方向移动任务，要求智能体按照连续目标向量指定的八个方向行走。双线性结构使每个策略头专精于部分方向，而共享系数层实现跨方向泛化，通过目标嵌入空间插值适应新方向。实验结果表明，共享低维目标嵌入为高维控制系统的快速结构化适应提供通用机制，并揭示了复杂强化学习系统中生物合理性高效迁移的潜在原理。