CACTO-SL: Using Sobolev Learning to improve Continuous Actor-Critic with Trajectory Optimization

Trajectory Optimization (TO) and Reinforcement Learning (RL) are powerful and complementary tools to solve optimal control problems. On the one hand, TO can efficiently compute locally-optimal solutions, but it tends to get stuck in local minima if the problem is not convex. On the other hand, RL is typically less sensitive to non-convexity, but it requires a much higher computational effort. Recently, we have proposed CACTO (Continuous Actor-Critic with Trajectory Optimization), an algorithm that uses TO to guide the exploration of an actor-critic RL algorithm. In turns, the policy encoded by the actor is used to warm-start TO, closing the loop between TO and RL. In this work, we present an extension of CACTO exploiting the idea of Sobolev learning. To make the training of the critic network faster and more data efficient, we enrich it with the gradient of the Value function, computed via a backward pass of the differential dynamic programming algorithm. Our results show that the new algorithm is more efficient than the original CACTO, reducing the number of TO episodes by a factor ranging from 3 to 10, and consequently the computation time. Moreover, we show that CACTO-SL helps TO to find better minima and to produce more consistent results.

翻译：轨迹优化（TO）与强化学习（RL）是解决最优控制问题时功能强大且互补的工具。一方面，TO能高效计算局部最优解，但若问题非凸则易陷入局部极小值；另一方面，RL对非凸性通常较不敏感，但需要更高的计算开销。近期，我们提出了CACTO（基于轨迹优化的连续型演员-评论家）算法，该算法利用TO引导演员-评论家RL算法的探索过程，而演员编码的策略则用于热启动TO，形成TO与RL的闭环。本研究提出了CACTO的扩展版本，通过引入索博列夫学习思想，利用微分动态规划算法的反向传播计算值函数梯度，以提升评论家网络的训练速度与数据效率。实验结果表明，新算法较原始CACTO更高效，将TO执行轮次减少3-10倍，从而显著降低计算时间。此外，CACTO-SL能帮助TO找到更优的极小值点，并产生更稳定的结果。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日