Logic-Skill Programming: An Optimization-based Approach to Sequential Skill Planning

Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given solely in terms of the final geometric configuration rather than a symbolic goal. To address this challenge, we propose Logic-Skill Programming (LSP), an optimization-based approach that sequences independently learned skills to solve long-horizon tasks. We formulate a first-order extension of a mathematical program to optimize the overall cumulative reward of all skills within a plan, abstracted by the sum of value functions. To solve such programs, we leverage the use of tensor train factorization to construct the value function space, and rely on alternations between symbolic search and skill value optimization to find the appropriate skill skeleton and optimal subgoal sequence. Experimental results indicate that the obtained value functions provide a superior approximation of cumulative rewards compared to state-of-the-art reinforcement learning methods. Furthermore, we validate LSP in three manipulation domains, encompassing both prehensile and non-prehensile primitives. The results demonstrate its capability to identify the optimal solution over the full logic and geometric path. The real-robot experiments showcase the effectiveness of our approach to cope with contact uncertainty and external disturbances in the real world.

翻译：近年来，机器人技能学习领域的进展使得构建任务无关的技能库成为可能，从而能够通过无缝衔接多个简单操作原语（即技能）来解决更为复杂的任务。然而，如何为独立学习的技能确定最优序列仍然是一个开放性问题，尤其是在目标仅以最终几何构型给出而非符号化目标时。为应对这一挑战，我们提出逻辑技能编程（LSP），这是一种基于优化的方法，通过对独立学习的技能进行排序来解决长时域任务。我们构建了一种数学规划的一阶扩展形式，以优化计划内所有技能的总体累积奖励，该奖励通过价值函数之和进行抽象。为求解此类规划问题，我们利用张量链分解构建价值函数空间，并依赖符号搜索与技能价值优化之间的交替迭代，以找到合适的技能骨架和最优子目标序列。实验结果表明，与当前最先进的强化学习方法相比，所获得的价值函数能更优地近似累积奖励。此外，我们在三个操作领域验证了LSP方法，涵盖抓取与非抓取两类操作原语。结果证明该方法能够识别完整逻辑路径与几何路径上的最优解。真实机器人实验展示了我们的方法在应对现实世界中接触不确定性和外部干扰方面的有效性。