High altitude balloons have proved useful for ecological aerial surveys, atmospheric monitoring, and communication relays. However, due to weight and power constraints, there is a need to investigate alternate modes of propulsion to navigate in the stratosphere. Very recently, reinforcement learning has been proposed as a control scheme to maintain the balloon in the region of a fixed location, facilitated through diverse opposing wind-fields at different altitudes. Although air-pump based station keeping has been explored, there is no research on the control problem for venting and ballasting actuated balloons, which is commonly used as a low-cost alternative. We show how reinforcement learning can be used for this type of balloon. Specifically, we use the soft actor-critic algorithm, which on average is able to station-keep within 50\;km for 25\% of the flight, consistent with state-of-the-art. Furthermore, we show that the proposed controller effectively minimises the consumption of resources, thereby supporting long duration flights. We frame the controller as a continuous control reinforcement learning problem, which allows for a more diverse range of trajectories, as opposed to current state-of-the-art work, which uses discrete action spaces. Furthermore, through continuous control, we can make use of larger ascent rates which are not possible using air-pumps. The desired ascent-rate is decoupled into desired altitude and time-factor to provide a more transparent policy, compared to low-level control commands used in previous works. Finally, by applying the equations of motion, we establish appropriate thresholds for venting and ballasting to prevent the agent from exploiting the environment. More specifically, we ensure actions are physically feasible by enforcing constraints on venting and ballasting.
翻译:高空气球已被证明可用于生态航空调查、大气监测和通信中继。但由于重量和功率限制,需要研究替代推进方式以在平流层中航行。近期,强化学习被提出作为一种控制方案,通过利用不同高度的对立风场,使气球保持在固定位置区域内。尽管基于气泵的定点保持方法已有探索,但针对泄气与压舱物驱动气球的控制问题尚无相关研究——这类气球常用作低成本方案。我们展示了强化学习如何应用于此类气球,具体采用柔性演员-评论家算法,该算法平均可使飞行器在50公里范围内实现25%飞行时间的定点保持,性能与现有最优方法持平。此外,所提出的控制器能有效最小化资源消耗,从而支持长航时飞行。我们将该控制器构建为连续控制强化学习问题,相较于现有使用离散动作空间的最优方法,本方案可实现更多样化的轨迹。通过连续控制,我们还能利用气泵无法实现的较大上升速率。与以往工作中的低层控制指令不同,我们将期望上升速率分解为目标高度和时间因子,以提供更透明的策略。最后,通过应用运动方程,我们建立了泄气与压舱的合适阈值,防止智能体过度利用环境。具体而言,我们通过施加泄气与压舱约束,确保动作的物理可行性。