Scaling Learning based Policy Optimization for Temporal Tasks via Dropout

This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment. We desire the trained policy to ensure that the agent satisfies specific task objectives, expressed in discrete-time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback controllers, and we assume a feed forward neural network for learning these feedback controllers. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and na\"{i}ve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. We show that, the existing smooth semantics for robustness are inefficient regarding gradient computation when the specification becomes complex. To address this challenge, we propose a new smooth semantics for DT-STL that under-approximates the robustness value and scales well for backpropagation over a complex specification. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.

翻译：本文提出一种基于模型的方法，用于训练在高非线性环境中运行的自主智能体的反馈控制器。我们希望训练得到的策略能确保智能体满足用离散时间信号时态逻辑(DT-STL)表达的具体任务目标。采用DT-STL等形式化框架重构任务的优点之一在于其允许定量满足语义。即给定一条轨迹和DT-STL公式，可计算出鲁棒性，该指标可解释为轨迹与满足公式的轨迹集之间的近似符号距离。我们利用反馈控制器，并假设采用前馈神经网络学习这些控制器。我们展示了该学习问题与训练循环神经网络(RNN)的相似性，其中循环单元的数量与智能体任务目标的时间范围成正比。这带来一个挑战：RNN容易产生梯度消失和梯度爆炸问题，因此基于朴素梯度下降的策略在求解长时域任务目标时也会遭遇同样问题。为应对这一挑战，我们提出一种基于丢包或梯度采样思想的新型梯度近似算法。我们证明，现有平滑鲁棒性语义在规范复杂时对梯度计算效率低下。为解决该问题，我们提出DT-STL的新平滑语义，该语义对鲁棒性值进行下近似，并能在复杂规范的反向传播中实现良好扩展性。实验表明，我们的控制综合方法可显著帮助随机梯度下降在减少数值问题的同时收敛，从而实现对长时间跨度和高维状态空间轨迹的可扩展反向传播。