In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, Offline Learning from Observations (LfO) is extensively studied, where the agent learns to solve a task with only expert states and \textit{task-agnostic} non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods minimize the state occupancy divergence between the learner and expert policies. However, they are limited to either $f$-divergences (KL and $\chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To address this problem, we propose Primal Wasserstein DICE (PW-DICE), which minimizes the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer and leverages a contrastively learned distance as the underlying metric for the Wasserstein distance. Theoretically, we prove that our framework is a generalization of the state-of-the-art, SMODICE, and unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods on multiple testbeds.
翻译:在现实场景中,与环境任意交互通常成本高昂,且专家演示的动作并不总是可用。为减少对两者的依赖,离线观察学习(LfO)得到广泛研究,其中智能体仅利用专家状态和\textit{任务无关}的非专家状态-动作对来学习解决任务。最先进的分布修正估计(DICE)方法通过最小化学习策略与专家策略之间的状态占优散度进行优化。然而,这些方法仅限于$f$-散度(KL散度和$\chi^2$散度)或基于鲁宾斯坦对偶性的Wasserstein距离,后者对影响Wasserstein解性能的关键距离度量施加了约束。为解决此问题,我们提出原始Wasserstein DICE(PW-DICE),该方法通过悲观正则化项最小化专家与学习策略状态占优之间的原始Wasserstein距离,并利用对比学习得到的距离作为Wasserstein距离的底层度量。理论上,我们证明该框架是最先进方法SMODICE的泛化形式,并统一了$f$-散度与Wasserstein最小化。实验表明,PW-DICE在多个测试平台上优于多种最先进方法。