This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.
翻译:本文提出了一种结合摊销方法与基于粒子方法的协同策略,用于从非归一化密度函数定义的分布中进行采样。我们阐述了序列蒙特卡洛(SMC)与通过最大熵强化学习(MaxEnt RL)训练的神经序列采样器之间的联系,其中学习到的采样策略与价值函数分别定义了建议核与扭曲函数。利用这一联系,我们为采样器引入了一种离策略强化学习训练流程,该流程使用来自SMC的样本(以学习到的采样器作为建议分布)作为行为策略,从而更好地探索目标分布。我们描述了稳定联合训练建议核与扭曲函数的技术,以及一种用于降低训练信号方差的自适应权重退火方案。此外,基于以往利用经验回放指导神经采样器训练的尝试,我们推导出一种在回放缓冲区内结合历史样本与退火重要性采样权重的方法。在合成多模态目标(连续与离散空间)以及丙氨酸二肽构象的玻尔兹曼分布上,相较于纯摊销方法与蒙特卡洛方法,我们的方法在逼近真实分布与训练稳定性方面均展现出改进。