In the pursuit of autonomous spacecraft proximity maneuvers and docking(PMD), we introduce a novel Bayesian actor-critic reinforcement learning algorithm to learn a control policy with the stability guarantee. The PMD task is formulated as a Markov decision process that reflects the relative dynamic model, the docking cone and the cost function. Drawing from the principles of Lyapunov theory, we frame the temporal difference learning as a constrained Gaussian process regression problem. This innovative approach allows the state-value function to be expressed as a Lyapunov function, leveraging the Gaussian process and deep kernel learning. We develop a novel Bayesian quadrature policy optimization procedure to analytically compute the policy gradient while integrating Lyapunov-based stability constraints. This integration is pivotal in satisfying the rigorous safety demands of spaceflight missions. The proposed algorithm has been experimentally evaluated on a spacecraft air-bearing testbed and shows impressive and promising performance.
翻译:为实现航天器自主近距机动与对接(PMD),我们提出了一种新颖的贝叶斯演员-评论家强化学习算法,用于学习具有稳定性保证的控制策略。PMD任务被建模为反映相对动力学模型、对接锥体与代价函数的马尔可夫决策过程。基于李雅普诺夫理论原理,我们将时序差分学习构建为约束高斯过程回归问题。这一创新方法使得状态价值函数能够表示为李雅普诺夫函数,并利用高斯过程与深度核学习。我们发展了一种新颖的贝叶斯正交策略优化流程,用于解析计算策略梯度,同时整合基于李雅普诺夫的稳定性约束。这种集成对于满足航天飞行任务的严格安全需求至关重要。所提出的算法已在航天器气浮试验平台上进行了实验评估,展现出令人印象深刻且前景广阔的性能。