We leverage the fast physics simulator, MuJoCo to run tasks in a continuous control environment and reveal details like the observation space, action space, rewards, etc. for each task. We benchmark value-based methods for continuous control by comparing Q-learning and SARSA through a discretization approach, and using them as baselines, progressively moving into one of the state-of-the-art deep policy gradient method DDPG. Over a large number of episodes, Qlearning outscored SARSA, but DDPG outperformed both in a small number of episodes. Lastly, we also fine-tuned the model hyper-parameters expecting to squeeze more performance but using lesser time and resources. We anticipated that the new design for DDPG would vastly improve performance, yet after only a few episodes, we were able to achieve decent average rewards. We expect to improve the performance provided adequate time and computational resources.
翻译:我们利用快速物理模拟器MuJoCo在连续控制环境中运行任务,并揭示每个任务的观测空间、动作空间、奖励等详细内容。通过离散化方法比较Q-learning和SARSA,并以它们为基线,逐步过渡到最先进的深度策略梯度方法DDPG,对基于值函数的连续控制方法进行基准测试。在大量回合中,Q-learning得分高于SARSA,而DDPG在少量回合中即优于两者。最后,我们还对模型超参数进行了微调,期望以更少的时间和资源获得更高性能。我们预期DDPG的新设计将大幅提升性能,而仅在数回合后,我们就获得了可观的平均奖励。只要提供充足的时间和计算资源,我们期望进一步改善性能。