Trajectory Optimization (TO) and Reinforcement Learning (RL) offer complementary strengths for solving optimal control problems. TO efficiently computes locally optimal solutions but can struggle with non-convexity, while RL is more robust to non-convexity at the cost of significantly higher computational demands. CACTO (Continuous Actor-Critic with Trajectory Optimization) was introduced to combine these advantages by learning a warm-start policy that guides the TO solver towards low-cost trajectories. However, scalability remains a key limitation, as increasing system complexity significantly raises the computational cost of TO. This work introduces CACTO-BIC to address these challenges. CACTO-BIC improves data efficiency by biasing initial-state sampling leveraging a property of the value function associated with locally optimal policies; moreover, it reduces computation time by exploiting GPU acceleration. Empirical evaluations show improved sample efficiency and faster computation compared to CACTO. Comparisons with PPO demonstrate that our approach can achieve similar solutions in less time. Finally, experiments on the AlienGO quadruped robot demonstrate that CACTO-BIC can scale to high-dimensional systems and is suitable for real-time applications.
翻译:轨迹优化(TO)与强化学习(RL)为解决最优控制问题提供了互补的优势。TO能高效计算局部最优解,但可能受限于非凸性;而RL对非凸性更为鲁棒,但其计算需求显著更高。CACTO(基于轨迹优化的连续演员-评论家方法)被提出以结合这些优势,通过学习一个热启动策略来引导TO求解器找到低成本的轨迹。然而,可扩展性仍是一个关键限制,因为系统复杂度的增加会显著提高TO的计算成本。本文提出CACTO-BIC以应对这些挑战。CACTO-BIC通过利用与局部最优策略相关的价值函数特性进行初始状态偏置采样,从而提高了数据效率;此外,它通过利用GPU加速减少了计算时间。实证评估表明,与CACTO相比,该方法具有更高的样本效率和更快的计算速度。与PPO的比较表明,我们的方法能够在更短的时间内实现相似的解。最后,在AlienGO四足机器人上的实验证明,CACTO-BIC能够扩展到高维系统,并适用于实时应用。