Generative Flow Networks or GFlowNets are related to Monte-Carlo Markov chain methods (as they sample from a distribution specified by an energy function), reinforcement learning (as they learn a policy to sample composed objects through a sequence of steps), generative models (as they learn to represent and sample from a distribution) and amortized variational methods (as they can be used to learn to approximate and sample from an otherwise intractable posterior, given a prior and a likelihood). They are trained to generate an object $x$ through a sequence of steps with probability proportional to some reward function $R(x)$ (or $\exp(-\mathcal{E}(x))$ with $\mathcal{E}(x)$ denoting the energy function), given at the end of the generative trajectory. Like for other RL settings where the reward is only given at the end, the efficiency of training and credit assignment may suffer when those trajectories are longer. With previous GFlowNet work, no learning was possible from incomplete trajectories (lacking a terminal state and the computation of the associated reward). In this paper, we consider the case where the energy function can be applied not just to terminal states but also to intermediate states. This is for example achieved when the energy function is additive, with terms available along the trajectory. We show how to reparameterize the GFlowNet state flow function to take advantage of the partial reward already accrued at each state. This enables a training objective that can be applied to update parameters even with incomplete trajectories. Even when complete trajectories are available, being able to obtain more localized credit and gradients is found to speed up training convergence, as demonstrated across many simulations.
翻译:生成流网络(GFlowNets)与蒙特卡洛马尔可夫链方法(通过能量函数指定分布进行采样)、强化学习(通过步进序列学习组合式对象的采样策略)、生成模型(学习分布表征与采样)以及摊销变分方法(给定先验与似然时近似采样难以处理的后验)密切相关。该类网络通过步进序列生成对象$x$,其生成概率与生成轨迹末端给定的奖励函数$R(x)$(或能量函数$\mathcal{E}(x)$对应的$\exp(-\mathcal{E}(x))$)成正比。与仅在末端给予奖励的其他强化学习设定类似,当生成轨迹较长时,训练效率与信用分配可能受到影响。现有GFlowNet研究无法从不完整轨迹(缺乏终止态及对应奖励计算)中进行学习。本文考虑了能量函数既适用于终止态也适用于中间态的情形,例如当能量函数具有可加性且沿轨迹存在可用的项时即可实现该特性。我们展示了如何通过重新参数化GFlowNet状态流函数来利用各状态已积累的部分奖励,从而建立可对不完整轨迹参数进行更新的训练目标函数。实验表明,即使存在完整轨迹,获取更局部化的信用与梯度仍能加速训练收敛——该结论已在多项仿真实验中得到验证。