While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions. This learned geometry enables the generation of a similarity kernel over candidate actions sampled at each update, allowing policy improvement to be guided directly toward higher-value regions beyond local gradient-based updates. As a result, representation learning, value estimation, and policy optimization are unified within a single geometry-consistent objective, while preserving the scalability of off-policy actor-critic training. The proposed method is evaluated on standard MuJoCo continuous-control benchmarks, demonstrating improvements over strong baselines on challenging high-dimensional tasks. Ablation studies are done to analyze the contributions of value-geometry learning and similarity-based policy updates.
翻译:尽管表征学习和相似性学习已提升了强化学习(RL)的样本效率,但它们很少被直接用于在动作空间中塑造策略更新。为填补这一空白,本文提出了一种几何感知的RL算法——状态-动作值几何优化(SAVGO),该算法明确地将基于值的相似性纳入策略更新。具体而言,SAVGO学习了一个联合状态-动作嵌入空间,在该空间中,具有相似动作值估计的样本对呈现高余弦相似度,而不相似的样本对被映射到不同的方向。这一习得的几何结构使得能够为每次更新时采样的候选动作生成相似性核,从而引导策略改进直接朝向更高值区域,超越了局部基于梯度的更新。由此,表征学习、值估计和策略优化被统一到单个几何一致的优化目标中,同时保留了离策略演员-评论家训练的可扩展性。所提出的方法在标准MuJoCo连续控制基准上进行了评估,在具有挑战性的高维任务上展示了相较于强基线的提升。消融实验分析了值几何学习和基于相似性的策略更新的贡献。