Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

翻译：强化学习为学习能够复现模拟角色多样化运动的控制策略提供了一个框架。然而，此类策略常常利用人类或物理机器人无法实现的不自然高频信号，使其难以代表真实世界的行为。现有工作通过添加一个惩罚动作随时间发生较大变化的奖励项来解决此问题。该奖励项通常需要大量的调优工作。我们提出使用动作雅可比惩罚，它通过自动微分直接惩罚动作相对于模拟状态变化的改变。这有效地消除了不现实的高频控制信号，而无需针对特定任务进行调优。虽然有效，但当与传统全连接神经网络架构一起使用时，动作雅可比惩罚会引入显著的计算开销。为了缓解这个问题，我们引入了一种称为线性策略网络的新架构，它显著降低了训练期间计算动作雅可比惩罚的计算负担。此外，线性策略网络不需要参数调优，与基线方法相比展现出更快的学习收敛速度，并且在推理时比全连接神经网络具有更高的查询效率。我们证明，线性策略网络结合动作雅可比惩罚，能够学习生成平滑信号的控制策略，同时解决一系列具有不同特性的运动模仿任务，包括后空翻等动态运动以及各种具有挑战性的跑酷技能。最后，我们将此方法应用于为配备机械臂的物理四足机器人创建动态运动控制策略。