In the pursuit of autonomous spacecraft proximity maneuvers and docking(PMD), we introduce a novel Bayesian actor-critic reinforcement learning algorithm to learn a control policy with the stability guarantee. The PMD task is formulated as a Markov decision process that reflects the relative dynamic model, the docking cone and the cost function. Drawing from the principles of Lyapunov theory, we frame the temporal difference learning as a constrained Gaussian process regression problem. This innovative approach allows the state-value function to be expressed as a Lyapunov function, leveraging the Gaussian process and deep kernel learning. We develop a novel Bayesian quadrature policy optimization procedure to analytically compute the policy gradient while integrating Lyapunov-based stability constraints. This integration is pivotal in satisfying the rigorous safety demands of spaceflight missions. The proposed algorithm has been experimentally evaluated on a spacecraft air-bearing testbed and shows impressive and promising performance.
翻译:为实现航天器自主近距离机动与对接任务,本文提出一种具有稳定性保证的新型贝叶斯执行者-评论者强化学习算法来学习控制策略。该任务被构建为马尔可夫决策过程,其中包含相对动力学模型、对接锥约束及代价函数。基于李雅普诺夫稳定性理论,我们将时序差分学习重构为约束型高斯过程回归问题。这一创新方法通过结合高斯过程与深度核学习技术,使状态价值函数可表征为李雅普诺夫函数。我们进一步提出新型贝叶斯积分策略优化流程,在解析计算策略梯度的同时融入基于李雅普诺夫的稳定性约束,该融合设计对满足航天任务严苛的安全要求具有关键意义。所提算法已在航天器气浮台实验平台上进行验证,展现出卓越且具有前景的控制性能。