Symphony：一种启发式归一化校准优势演员-评论家算法及其在人形机器人中的应用 (Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots)

In our work we implicitly suggest that it is a misconception to think that humans learn fast. The learning process takes time. Babies start learning to move in the restricted fluid environment of the womb. Children are often limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for tens of millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is well known that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are submerged in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update the Actor and Critic in one pass, as well as combine the Actor and Critic in one Object and implement their Losses in one line.

翻译：在我们的工作中，我们隐含地指出“人类学习速度快”是一种误解。学习过程需要时间。婴儿在子宫这一受限的流体环境中开始学习运动。儿童常受限于未发育成熟的身体。即使是成年人，也无法立即参与复杂的竞赛。然而，对于机器人而言，当从零开始学习时，我们通常没有等待数千万步的奢侈条件。“襁褓”正则化旨在约束智能体在快速但不稳定的发展过程中，以特定方式惩罚动作强度而不直接影响动作本身。Symphony（过渡策略确定性演员-评论家算法）是多种思想的简洁结合，旨在实现从零开始训练人形机器人，同时兼顾样本效率、样本邻近性与动作安全性。众所周知，未经适当平滑的高斯噪声持续增加对电机和齿轮箱有害。相较于随机算法，我们设置有限的参数噪声并促进降低动作强度，从而安全地增加熵，因为动作被淹没在较弱的噪声中。当动作需要更极端的值时，动作会从弱噪声中凸显出来。经验表明，这种训练方式对周围环境和机器人机械结构都更为安全。我们使用衰减回放缓冲区：通过一个包含双曲正切函数的固定公式，我们调整批次采样概率：记忆体包含近期记忆和长期记忆轨迹。衰减回放缓冲区使我们能够在改进当前评论家网络预测（相较于指数移动平均）时利用时序优势。时序优势使我们能够单次更新演员和评论家，并将演员和评论家结合在一个对象中，用一行代码实现它们的损失函数。