Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning

Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.

翻译：强化学习（RL）在复杂任务中的应用仍然面临挑战，这主要源于标量奖励函数设计的困难以及从头训练模型的固有低效性。相反，更好的做法是根据基本子任务来指定复杂任务，并尽可能重用子任务的解决方案。本文研究连续空间中具有优先顺序子任务的词法多目标强化学习问题，这类问题通常难以求解。我们证明此类问题可通过子任务变换进行标量化，并利用价值分解逐步求解。基于这一见解，我们提出优先考虑软Q分解（PSQD）算法，这是一种在连续状态-动作空间中学习并调整遵从词法优先顺序的子任务解决方案的新方法。PSQD支持零样本组合复用先前学习到的子任务解决方案，随后进行适应性调整。其利用保留的子任务训练数据进行离线学习的特点，使适应过程中无需与环境进行新交互。我们通过低维和高维模拟机器人控制任务的成功学习、复用和调整结果，以及离线学习结果验证了方法的有效性。与基线方法相比，PSQD不会在冲突子任务或优先级约束间进行权衡，并在学习过程中完全满足子任务优先级要求。PSQD为处理复杂强化学习问题提供了直观的框架，揭示了子任务组合的内部工作机制。