Generative flow networks (GFlowNets) are amortized variational inference algorithms that treat sampling from a distribution over compositional objects as a sequential decision-making problem with a learnable action policy. Unlike other algorithms for hierarchical sampling that optimize a variational bound, GFlowNet algorithms can stably run off-policy, which can be advantageous for discovering modes of the target distribution. Despite this flexibility in the choice of behaviour policy, the optimal way of efficiently selecting trajectories for training has not yet been systematically explored. In this paper, we view the choice of trajectories for training as an active learning problem and approach it using Bayesian techniques inspired by methods for multi-armed bandits. The proposed algorithm, Thompson sampling GFlowNets (TS-GFN), maintains an approximate posterior distribution over policies and samples trajectories from this posterior for training. We show in two domains that TS-GFN yields improved exploration and thus faster convergence to the target distribution than the off-policy exploration strategies used in past work.
翻译:生成流网络(GFlowNets)是一种摊销变分推理算法,它将从组合对象分布中采样视为一个具有可学习动作策略的序列决策问题。与其他优化变分界的分层采样算法不同,GFlowNet算法能够稳定地运行离策略学习,这有助于发现目标分布的模态。尽管行为策略的选择具有灵活性,但如何高效选择训练轨迹的最优方式尚未得到系统研究。本文将训练轨迹的选择视为主动学习问题,并借鉴多臂老虎机方法中的贝叶斯技术进行求解。所提出的算法——Thompson采样GFlowNets(TS-GFN)维护策略的近似后验分布,并从中采样轨迹用于训练。我们在两个领域中的实验表明,与以往工作中使用的离策略探索策略相比,TS-GFN能够实现更优的探索,从而更快地收敛到目标分布。