This paper studies Generative Flow Networks (GFlowNets), which learn to sample objects proportionally to a given reward function through the trajectory of state transitions. In this work, we observe that GFlowNets tend to under-exploit the high-reward objects due to training on insufficient number of trajectories, which may lead to a large gap between the estimated flow and the (known) reward value. In response to this challenge, we propose a pessimistic backward policy for GFlowNets (PBP-GFN), which maximizes the observed flow to align closely with the true reward for the object. We extensively evaluate PBP-GFN across eight benchmarks, including hyper-grid environment, bag generation, structured set generation, molecular generation, and four RNA sequence generation tasks. In particular, PBP-GFN enhances the discovery of high-reward objects, maintains the diversity of the objects, and consistently outperforms existing methods.
翻译:本文研究生成流网络(GFlowNets),其通过学习状态转移的轨迹来按给定奖励函数比例采样对象。本工作中,我们观察到由于训练轨迹数量不足,GFlowNets往往对高奖励对象的利用不足,这可能导致估计流与(已知)奖励值之间存在较大差距。针对这一挑战,我们提出用于GFlowNets的悲观反向策略(PBP-GFN),该策略通过最大化观测流使其与对象的真实奖励紧密对齐。我们在八个基准测试中广泛评估PBP-GFN,包括超网格环境、背包生成、结构化集合生成、分子生成以及四个RNA序列生成任务。特别地,PBP-GFN提升了对高奖励对象的发现能力,保持了对象的多样性,并持续优于现有方法。