In this work we theoretically show that conservative objective models (COMs) for offline model-based optimisation (MBO) are a special kind of contrastive divergence-based energy model, one where the energy function represents both the unconditional probability of the input and the conditional probability of the reward variable. While the initial formulation only samples modes from its learned distribution, we propose a simple fix that replaces its gradient ascent sampler with a Langevin MCMC sampler. This gives rise to a special probabilistic model where the probability of sampling an input is proportional to its predicted reward. Lastly, we show that better samples can be obtained if the model is decoupled so that the unconditional and conditional probabilities are modelled separately.
翻译:在本工作中,我们从理论上证明:离线模型基优化中的保守目标模型(COMs)是基于对比散度的能量模型的一个特殊类别——其能量函数同时表征输入的无条件概率和奖励变量的条件概率。尽管原始公式仅从其学习分布中采样模式,我们提出了一种简单修正:用Langevin MCMC采样器替代其梯度上升采样器。由此生成一种特殊概率模型,其输入采样概率与该输入的预测奖励成正比。最后,我们证明若将模型解耦为分别建模无条件概率与条件概率,可获得更优的采样结果。