This paper presents a learning-based guidance-and-control approach that couples a reasoning-enabled Large Language Model (LLM) with Group Relative Policy Optimization (GRPO). A two-stage procedure consisting of Supervised Fine-Tuning (SFT) to learn formatting and control primitives, followed by GRPO for interaction-driven policy improvement, trains controllers for each environment. The framework is demonstrated on four control problems spanning a gradient of dynamical complexity, from canonical linear systems through nonlinear oscillatory dynamics to three-dimensional spacecraft attitude control with gyroscopic coupling and thrust constraints. Results demonstrate that an LLM with explicit reasoning, optimized via GRPO, can synthesize feasible stabilizing policies under consistent training settings across both linear and nonlinear systems. The two-stage training methodology enables models to generate control sequences while providing human-readable explanations of their decision-making process. This work establishes a foundation for applying GRPO-based reasoning to autonomous control systems, with potential applications in aerospace and other safety-critical domains.
翻译:本文提出一种基于学习的制导与控制方法,该方法将具备推理能力的大语言模型(LLM)与群体相对策略优化(GRPO)相结合。通过包含监督微调(SFT)以学习格式化和控制基元、随后采用GRPO进行交互驱动的策略改进的两阶段流程,为每个环境训练控制器。该框架在四个控制问题上得到验证,这些问题涵盖从经典线性系统、非线性振荡动力学到具有陀螺耦合与推力约束的三维航天器姿态控制的动态复杂性梯度。结果表明,通过GRPO优化的、具备显式推理能力的LLM,在线性和非线性系统的一致训练设置下,能够综合出可行的稳定策略。两阶段训练方法使模型能够生成控制序列,同时提供其决策过程的人类可读解释。这项工作为将基于GRPO的推理应用于自主控制系统奠定了基础,在航空航天及其他安全关键领域具有潜在应用前景。