This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.
翻译:本文提出了一种将摊销方法与基于粒子的方法相结合的协同方案,用于从由非归一化密度函数定义的分布中进行采样。我们建立了序列蒙特卡洛方法与最大熵强化学习训练的神经序列采样器之间的联系,其中学习到的采样策略与价值函数分别定义了提议核与扭曲函数。利用这一联系,我们引入了一种基于离策略强化学习的训练流程:将使用学习到的采样器作为提议分布的序列蒙特卡洛采样结果作为行为策略,从而更有效地探索目标分布。我们描述了用于稳定联合训练提议与扭曲函数的技术,以及一种自适应权重退火方案以降低训练信号方差。此外,基于先前利用经验回放指导神经采样器训练的尝试,我们推导出一种在回放缓冲区中结合历史样本与退火重要性采样权重的方法。在合成多模态目标(包括连续与离散空间)以及丙氨酸二肽构象的玻尔兹曼分布上,我们证明了所提方法相较于纯摊销方法与蒙特卡洛方法,在近似真实分布及训练稳定性方面的提升。