This paper explores policy gradient algorithms for training stochastic policies to sample from structured discrete probability distributions under the Generative Flow Network (GFlowNet) framework. Building on extensive theoretical connections between GFlowNets and entropy-regularized reinforcement learning, we derive equivalents of standard policy gradient algorithms for training GFlowNets, as well as experimentally explore their various methodological aspects, including baseline training and advantage estimation. Most importantly, our work is the first to derive and successfully apply proximal policy optimization to GFlowNets, showing its improved convergence speed and data efficiency compared to standard GFlowNet training objectives on benchmarks ranging from synthetic energies to molecular graph generation.
翻译:本文探讨了在生成流网络(GFlowNet)框架下,用于训练随机策略以对结构化离散概率分布进行采样的策略梯度算法。基于GFlowNet与熵正则化强化学习之间广泛的理论联系,我们推导了训练GFlowNet的标准策略梯度算法的等价形式,并通过实验探讨了其多种方法论方面的内容,包括基线训练和优势估计。最重要的是,我们的工作是首次推导并成功将近端策略优化应用于GFlowNet,表明相较于标准GFlowNet训练目标,其在从合成能量到分子图生成等基准任务上具有更快的收敛速度和更高的数据效率。