Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots

In our work we implicitly suggest that it is a misconception to think that humans learn fast. The learning process takes time. Babies start learning to move in the restricted fluid environment of the womb. Children are often limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for tens of millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is well known that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are submerged in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update the Actor and Critic in one pass, as well as combine the Actor and Critic in one Object and implement their Losses in one line.

翻译：在我们的工作中，我们隐式地指出“人类学习速度快”是一种误解。学习过程需要时间。婴儿在子宫受限的流体环境中开始学习运动。儿童常受限于未发育完全的身体。即使是成年人也无法立即参与复杂竞赛。然而对于机器人，当从零开始学习时，我们通常没有等待数千万步的特权。“襁褓”正则化通过特定方式约束智能体在快速但不稳定的发展过程中惩罚动作强度，而不直接影响动作。Symphony（过渡策略确定性行动者-评论家算法）是多种思想的简洁结合，旨在实现从零开始训练人形机器人时兼顾样本效率、样本邻近性与动作安全性。众所周知，未经适当平滑的高斯噪声持续增加对电机和齿轮箱有害。相较于随机算法，我们设置有限参数噪声并促进动作强度降低，在动作被较弱噪声淹没的情况下安全地增加熵。当动作需要更极值化时，动作将突破弱噪声的覆盖。经验表明，这种训练方式对周围环境及机器人机械结构都更为安全。我们采用衰减回放缓冲区：通过包含双曲正切函数的固定公式调整批次采样概率，使记忆同时包含近期记忆与长期记忆轨迹。衰减回放缓冲区使得在改进当前评论家网络预测（相较于指数移动平均）时能够利用时序优势。时序优势允许我们单次更新行动者与评论家网络，将二者结合于同一对象中，并用单行代码实现其损失函数。