Offline reinforcement learning (RL) can in principle synthesize more optimal behavior from a dataset consisting only of suboptimal trials. One way that this can happen is by "stitching" together the best parts of otherwise suboptimal trajectories that overlap on similar states, to create new behaviors where each individual state is in-distribution, but the overall returns are higher. However, in many interesting and complex applications, such as autonomous navigation and dialogue systems, the state is partially observed. Even worse, the state representation is unknown or not easy to define. In such cases, policies and value functions are often conditioned on observation histories instead of states. In these cases, it is not clear if the same kind of "stitching" is feasible at the level of observation histories, since two different trajectories would always have different histories, and thus "similar states" that might lead to effective stitching cannot be leveraged. Theoretically, we show that standard offline RL algorithms conditioned on observation histories suffer from poor sample complexity, in accordance with the above intuition. We then identify sufficient conditions under which offline RL can still be efficient -- intuitively, it needs to learn a compact representation of history comprising only features relevant for action selection. We introduce a bisimulation loss that captures the extent to which this happens, and propose that offline RL can explicitly optimize this loss to aid worst-case sample complexity. Empirically, we show that across a variety of tasks either our proposed loss improves performance, or the value of this loss is already minimized as a consequence of standard offline RL, indicating that it correlates well with good performance.
翻译:离线强化学习在原则上可以从仅包含次优试次的数据集中合成更优的行为。实现这一目标的一种方式是通过“拼接”在相似状态上重叠的次优轨迹的最佳部分,创建新的行为,其中每个单独的状态处于分布内,但整体回报更高。然而,在许多有趣且复杂的应用中,例如自主导航和对话系统,状态是部分可观测的。更糟糕的是,状态表示未知或不易定义。在这种情况下,策略和值函数通常基于观测历史而非状态进行条件设定。此时,尚不清楚同一类“拼接”在观测历史层面是否可行,因为不同轨迹总是具有不同的历史,因此可能导致有效拼接的“相似状态”无法被利用。理论上,我们表明基于观测历史的标准离线强化学习算法样本复杂度较差,这与上述直觉一致。随后,我们识别出离线强化学习仍能高效运行的充分条件——直观而言,它需要学习只包含与动作选择相关特征的紧凑历史表示。我们引入了一种双模拟损失,用于捕捉这一过程的程度,并提出离线强化学习可以显式优化此损失以改善最坏情况下的样本复杂度。实验上,我们展示了在各种任务中,要么我们的提出的损失改进了性能,要么该损失的值作为标准离线强化学习的副产品已被最小化,这表明该损失与良好性能高度相关。