Soft actor-critic (SAC) in reinforcement learning is expected to be one of the next-generation robot control schemes. Its ability to maximize policy entropy would make a robotic controller robust to noise and perturbation, which is useful for real-world robot applications. However, the priority of maximizing the policy entropy is automatically tuned in the current implementation, the rule of which can be interpreted as one for equality constraint, binding the policy entropy into its specified lower bound. The current SAC is therefore no longer maximize the policy entropy, contrary to our expectation. To resolve this issue in SAC, this paper improves its implementation with a learnable state-dependent slack variable for appropriately handling the inequality constraint to maximize the policy entropy by reformulating it as the corresponding equality constraint. The introduced slack variable is optimized by a switching-type loss function that takes into account the dual objectives of satisfying the equality constraint and checking the lower bound. In Mujoco and Pybullet simulators, the modified SAC statistically achieved the higher robustness for adversarial attacks than before while regularizing the norm of action. A real-robot variable impedance task was demonstrated for showing the applicability of the modified SAC to real-world robot control. In particular, the modified SAC maintained adaptive behaviors for physical human-robot interaction, which had no experience at all during training. https://youtu.be/EH3xVtlVaJw
翻译:强化学习中的软演员-评论家(SAC)算法有望成为下一代机器人控制方案之一。其最大化策略熵的能力可使机器人控制器对噪声和扰动具有鲁棒性,这在实际机器人应用中非常有用。然而,当前实现中最大化策略熵的优先级是自动调节的,其规则可解释为等式约束,即强制策略熵等于其指定下界。因此,当前的SAC算法不再最大化策略熵,这违背了我们的预期。为解决SAC中的这一问题,本文通过引入一个可学习的、依赖于状态的松弛变量来改进其实现,通过将不等式约束重新表述为相应的等式约束,从而恰当处理最大化策略熵的不等式约束。该松弛变量通过一种开关型损失函数进行优化,该函数兼顾满足等式约束与检查下界这两个目标。在Mujoco和Pybullet仿真器中,改进后的SAC算法在对抗攻击下比原算法取得了统计上更高的鲁棒性,同时正则化了动作范数。通过实际机器人变阻抗任务验证了改进SAC算法在真实机器人控制中的适用性。特别地,改进SAC算法在物理人机交互中能保持自适应行为,而该行为在训练过程中完全未经体验。https://youtu.be/EH3xVtlVaJw