Gradient-based methods enable efficient search capabilities in high dimensions. However, in order to apply them effectively in offline optimization paradigms such as offline Reinforcement Learning (RL) or Imitation Learning (IL), we require a more careful consideration of how uncertainty estimation interplays with first-order methods that attempt to minimize them. We study smoothed distance to data as an uncertainty metric, and claim that it has two beneficial properties: (i) it allows gradient-based methods that attempt to minimize uncertainty to drive iterates to data as smoothing is annealed, and (ii) it facilitates analysis of model bias with Lipschitz constants. As distance to data can be expensive to compute online, we consider settings where we need amortize this computation. Instead of learning the distance however, we propose to learn its gradients directly as an oracle for first-order optimizers. We show these gradients can be efficiently learned with score-matching techniques by leveraging the equivalence between distance to data and data likelihood. Using this insight, we propose Score-Guided Planning (SGP), a planning algorithm for offline RL that utilizes score-matching to enable first-order planning in high-dimensional problems, where zeroth-order methods were unable to scale, and ensembles were unable to overcome local minima. Website: https://sites.google.com/view/score-guided-planning/home
翻译:基于梯度的方法在高维空间中具备高效搜索能力。然而,要在离线优化范式(如离线强化学习或模仿学习)中有效应用此类方法,我们需要更审慎地考虑不确定性估计与试图最小化该估计的一阶方法之间的相互作用。我们以数据平滑距离作为不确定性度量,并论证其具有两个有益特性:(i)当退火处理平滑过程时,它能使试图最小化不确定性的一阶方法驱动迭代向数据靠近;(ii)它便于通过Lipschitz常数分析模型偏差。由于实时计算数据距离可能成本高昂,我们考虑需要分摊这种计算成本的场景。我们不直接学习距离本身,而是提出学习其梯度作为一阶优化器的预言机。研究表明,通过利用数据距离与数据似然之间的等价性,可借助分数匹配技术高效学习这些梯度。基于这一洞见,我们提出分数引导规划——一种用于离线强化学习的规划算法,该算法通过分数匹配实现高维问题中的一阶规划,而在此类场景中,零阶方法无法扩展,集成方法也难以克服局部极值。网址:https://sites.google.com/view/score-guided-planning/home