Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.
翻译:零样本强化学习(RL)的最新进展促进了从无标注离线数据源中提取多样化行为的能力。特别地,前向-后向算法(FB)能够获取一族策略,在容量充足的情况下,可近似求解任何标准RL问题(具有加性奖励,且奖励函数在占用测度上呈线性)。在保持零样本特性的同时,我们处理更具一般性的通用效用RL问题类别,其目标函数是占用测度的任意可微函数。该设定具有更严格的表达能力,可涵盖分布匹配或纯探索等任务,这些任务可能无法简化为加性奖励形式。我们证明,这种额外的复杂性可通过一种新颖的、基于最大熵(软)变体的前向-后向算法来捕捉,该算法能从离线数据中恢复一族随机策略。当与紧凑策略嵌入上的零阶搜索相结合时,该算法能够绕过迭代优化方案,并在测试时直接优化通用效用函数。通过教学性实验与高维实验,我们验证了所提方法在保持FB算法优良特性的同时,还能将其适用范围扩展到更一般的RL问题。