Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
翻译:在复杂控制系统中,快速适应仍是强化学习面临的核心挑战。我们提出一种框架,其中策略函数与价值函数共享低维系数向量——即目标嵌入(goal embedding),该嵌入捕获任务特征,无需重新训练表示即可即时适应新任务。在预训练阶段,我们通过双线性演员-评论家分解联合学习结构化价值基函数与兼容策略基函数。评论家分解为Q = sum_k G_k(g) y_k(s,a),其中G_k(g)是目标条件系数向量,y_k(s,a)是学习到的价值基函数。这种乘法门控机制——上下文信号通过缩放一组状态依赖基函数——类似于第五层锥体神经元中观察到的增益调制现象,其中自上而下的输入调整感觉驱动响应的增益而不改变其调谐特性。基于后继特征(Successor Features),我们将该分解扩展至演员,使其通过相同系数G_k(g)加权组合一组基础策略。测试阶段基函数被冻结,G_k(g)通过单次前向传播实现零样本估计,无需任何梯度更新即可即时适应新任务。我们在MuJoCo Ant环境中训练软演员-评论家(Soft Actor-Critic)智能体执行多方向移动任务,要求智能体按照连续目标向量指定的八个方向行走。双线性结构使每个策略头专精于部分方向,而共享系数层实现跨方向泛化,通过目标嵌入空间插值适应新方向。实验结果表明,共享低维目标嵌入为高维控制系统的快速结构化适应提供通用机制,并揭示了复杂强化学习系统中生物合理性高效迁移的潜在原理。