Offline reinforcement learning has shown tremendous success in behavioral planning by learning from previously collected demonstrations. However, decision-making in multitask missions still presents significant challenges. For instance, a mission might require an agent to explore an unknown environment, discover goals, and navigate to them, even if it involves interacting with obstacles along the way. Such behavioral planning problems are difficult to solve due to: a) agents failing to adapt beyond the single task learned through their reward function, and b) the inability to generalize to new environments not covered in the training demonstrations, e.g., environments where all doors were unlocked in the demonstrations. Consequently, state-of-the-art decision making methods are limited to missions where the required tasks are well-represented in the training demonstrations and can be solved within a short (temporal) planning horizon. To address this, we propose GenPlan: a stochastic and adaptive planner that leverages discrete-flow models for generative sequence modeling, enabling sample-efficient exploration and exploitation. This framework relies on an iterative denoising procedure to generate a sequence of goals and actions. This approach captures multi-modal action distributions and facilitates goal and task discovery, thereby enhancing generalization to out-of-distribution tasks and environments, i.e., missions not part of the training data. We demonstrate the effectiveness of our method through multiple simulation environments. Notably, GenPlan outperforms the state-of-the-art methods by over 10% on adaptive planning tasks, where the agent adapts to multi-task missions while leveraging demonstrations on single-goal-reaching tasks.
翻译:离线强化学习通过从先前收集的演示中学习,在行为规划方面取得了巨大成功。然而,多任务任务中的决策制定仍然面临重大挑战。例如,一项任务可能要求智能体探索未知环境、发现目标并导航至目标,即使过程中需要与障碍物交互。此类行为规划问题难以解决的原因在于:a) 智能体无法适应其奖励函数所学习的单一任务之外的情境;b) 无法泛化到训练演示未覆盖的新环境,例如演示中所有门均未上锁的环境。因此,最先进的决策方法仅限于那些所需任务在训练演示中得到充分体现且能在较短(时间)规划范围内解决的任务。为解决这一问题,我们提出GenPlan:一种随机自适应规划器,它利用离散流模型进行生成式序列建模,实现了样本高效的探索与利用。该框架依赖于迭代去噪过程来生成一系列目标与动作。这种方法能够捕捉多模态动作分布,并促进目标与任务的发现,从而增强对分布外任务与环境的泛化能力,即训练数据中未包含的任务。我们通过多个仿真环境验证了所提方法的有效性。值得注意的是,在自适应规划任务中,GenPlan的性能优于最先进方法超过10%,这些任务要求智能体在利用单目标到达任务演示的同时适应多任务使命。