Propagating state distributions through a generic, uncertain nonlinear dynamical model is known to be intractable and usually begets numerical or analytical approximations. We introduce a method for state prediction, called the Expansion-Compression Unscented Transform, and use it to solve a class of online policy optimization problems. Our proposed algorithm propagates a finite number of sigma points through a state-dependent distribution, which dictates an increase in the number of sigma points at each time step to represent the resulting distribution; this is what we call the expansion operation. To keep the algorithm scalable, we augment the expansion operation with a compression operation based on moment matching, thereby keeping the number of sigma points constant across predictions over multiple time steps. Its performance is empirically shown to be comparable to Monte Carlo but at a much lower computational cost. Under state and control input constraints, the state prediction is subsequently used in tandem with a proposed variant of constrained gradient-descent for online update of policy parameters in a receding horizon fashion. The framework is implemented as a differentiable computational graph for policy training. We showcase our framework for a quadrotor stabilization task as part of a benchmark comparison in safe-control-gym and for optimizing the parameters of a Control Barrier Function based controller in a leader-follower problem.
翻译:通过通用不确定非线性动态模型传播状态分布已知是难以处理的,通常需要数值或解析近似。本文提出一种名为扩展-压缩无迹变换的状态预测方法,并将其用于求解一类在线策略优化问题。所提算法通过状态依赖分布传播有限数量的西格玛点,这要求在每个时间步增加西格玛点数量以表征结果分布,我们称之为扩展操作。为保证算法的可扩展性,我们在扩展操作基础上引入基于矩匹配的压缩操作,从而在多个时间步的预测过程中保持西格玛点数量恒定。实验表明,该算法性能与蒙特卡洛方法相当,但计算成本显著降低。在状态和控制输入约束条件下,该状态预测方法进一步与所提出的约束梯度下降变体相结合,以滚动时域方式实现策略参数的在线更新。该框架作为可微计算图实现策略训练。我们以安全控制实验室中的四旋翼稳定任务为基准测试案例,并在领导者-跟随者问题中展示了该框架在优化基于控制障碍函数控制器参数方面的应用效果。