Reinforcement learning fine-tuning has proven effective for steering generative diffusion models toward desired properties in image and molecular domains. Graph diffusion models have similarly been applied to combinatorial structure generation, including neural architecture search (NAS). However, neural architectures are directed acyclic graphs (DAGs) where edge direction encodes functional semantics such as data flow-information that existing graph diffusion methods, designed for undirected structures, discard. We propose Directed Graph Policy Optimization (DGPO), which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding. Validated on NAS-Bench-101 and NAS-Bench-201, DGPO matches the benchmark optimum on all three NAS-Bench-201 tasks (91.61%, 73.49%, 46.77%). The central finding is that the model learns transferable structural priors: pretrained on only 7% of the search space, it generates near-oracle architectures after fine-tuning, within 0.32 percentage points of the full-data model and extrapolating 7.3 percentage points beyond its training ceiling. Bidirectional control experiments confirm genuine reward-driven steering, with inverse optimization reaching near random-chance accuracy (9.5%). These results demonstrate that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.
翻译:强化学习微调已被证明能有效引导生成式扩散模型在图像和分子领域朝向期望的属性发展。图扩散模型同样被应用于组合结构生成,包括神经架构搜索(NAS)。然而,神经架构是有向无环图(DAGs),其中边的方向编码了功能语义(如数据流信息),而现有的、为无向结构设计的图扩散方法丢弃了这些信息。我们提出了有向图策略优化(DGPO),它通过拓扑节点排序和位置编码,将离散图扩散模型的强化学习微调扩展至DAGs。在NAS-Bench-101和NAS-Bench-201上的验证表明,DGPO在所有三项NAS-Bench-201任务(91.61%、73.49%、46.77%)上均达到了基准最优值。核心发现是,该模型学习了可迁移的结构先验:仅在7%的搜索空间上进行预训练后,经过微调便能生成接近最优的架构,其性能与全数据模型相差仅0.32个百分点,并超越了其训练上限7.3个百分点。双向控制实验证实了真正的奖励驱动引导,逆向优化达到了接近随机猜测的准确率(9.5%)。这些结果表明,强化学习引导的离散扩散方法一旦扩展至能够处理方向性,便为有向组合结构提供了一个可控的生成框架。