Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
翻译:行为树(BTs)为决策制定提供了结构化且反应灵敏的框架,通常用于根据环境条件在不同子控制器之间进行切换。另一方面,强化学习(RL)能够学习接近最优的控制器,但有时在稀疏奖励、安全探索和长时程信用分配方面存在困难。将BTs与RL结合具有相互受益的潜力:BT设计编码了结构化的领域知识,可以简化RL训练,而RL则能够自动学习BTs内部的控制器。然而,BTs与RL的简单集成可能导致某些控制器抵消其他控制器的效果,可能破坏先前已实现的子目标,从而降低整体性能。为解决此问题,我们提出进度约束,这是一种新颖的机制,其中可行性估计器基于行为树收敛的理论结果来约束允许的动作集。在二维概念验证和高保真仓库环境中的实证评估表明,与先前的BT-RL集成方法相比,该方法在性能、样本效率和约束满足方面均有提升。