Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.
翻译:在生物和机器人应用中,由于状态-动作空间庞大,高维系统的控制具有挑战性,其中有效的探索至关重要。强化学习中常用的探索策略大多是无导向的,随着动作维度增加,其性能会急剧下降。许多现有方法采用降维技术,但这限制了策略的表达能力并牺牲了系统灵活性。我们提出了Q引导流探索(Qflex),这是一种可扩展的强化学习方法,可直接在原生高维动作空间中进行探索。在训练过程中,Qflex沿着由学习到的价值函数诱导的概率流,从可学习的源分布中遍历动作,使探索与任务相关的梯度对齐,而非各向同性噪声。我们提出的方法在多种高维连续控制基准测试中显著优于代表性的在线强化学习基线。Qflex还成功控制了一个全身人体肌肉骨骼模型,执行敏捷复杂的运动,在极高维环境中展示了卓越的可扩展性和样本效率。我们的结果表明,价值引导流为大规模探索提供了一条原则性且实用的路径。