This paper presents Dual Action Policy (DAP), a novel approach to address the dynamics mismatch inherent in the sim-to-real gap of reinforcement learning. DAP uses a single policy to predict two sets of actions: one for maximizing task rewards in simulation and another specifically for domain adaptation via reward adjustments. This decoupling makes it easier to maximize the overall reward in the source domain during training. Additionally, DAP incorporates uncertainty-based exploration during training to enhance agent robustness. Experimental results demonstrate DAP's effectiveness in bridging the sim-to-real gap, outperforming baselines on challenging tasks in simulation, and further improvement is achieved by incorporating uncertainty estimation.
翻译:本文提出双动作策略(DAP),一种解决强化学习中仿真到现实差距所固有的动力学失配问题的新方法。DAP使用单一策略预测两组动作:一组用于在仿真环境中最大化任务奖励,另一组则专门通过奖励调整实现领域自适应。这种解耦使得在训练期间最大化源域中的总体奖励变得更为容易。此外,DAP在训练过程中引入了基于不确定性的探索机制,以增强智能体的鲁棒性。实验结果表明,DAP在弥合仿真到现实差距方面具有显著效果,在仿真中的多项挑战性任务上表现优于基线方法,并且通过引入不确定性估计进一步提升了性能。