Molecular graph generation (MGG) is essentially a multi-class generative task, aimed at predicting categories of atoms and bonds under strict chemical and structural constraints. However, many prevailing diffusion paradigms learn to regress numerical embeddings and rely on a hard discretization rule during sampling to recover discrete labels. This introduces a fundamental discrepancy between training and sampling. While models are trained for point-wise numerical fidelity, the sampling process fundamentally relies on crossing categorical decision boundaries. This discrepancy forces the model to expend efforts on intra-class variations that become irrelevant after discretization, ultimately compromising diversity, structural statistics, and generalization performance. Therefore, we propose TopBF, a unified framework that (i) performs MGG directly in continuous parameter distributions, (ii) learns graph-topological understanding through a Quasi-Wasserstein optimal-transport coupling under geodesic costs, and (iii) supports controllable, property-conditioned generation during sampling without retraining the base model. TopBF innovatively employs cumulative distribution function (CDF) to compute category probabilities induced by the Gaussian channel, thereby unifying the training objective with the sampling discretization operation. Experiments on QM9 and ZINC250k demonstrate superior structural fidelity and efficient generation with improved performance.
翻译:分子图生成本质上是一个多类别生成任务,旨在严格的化学与结构约束下预测原子与键的类别。然而,当前主流的扩散范式通常学习回归数值嵌入,并在采样过程中依赖硬离散化规则来恢复离散标签。这导致训练与采样之间存在根本性差异:模型训练时追求逐点数值保真度,而采样过程本质上依赖于跨越类别决策边界。这种差异迫使模型将精力耗费在离散化后变得无关的类内变异上,最终损害了生成多样性、结构统计特性与泛化性能。为此,我们提出TopBF这一统一框架,其具备以下特点:(i) 直接在连续参数分布中进行分子图生成;(ii) 通过测地线代价下的拟瓦瑟斯坦最优传输耦合学习图拓扑理解;(iii) 在采样阶段支持无需重新训练基础模型的可控属性条件生成。TopBF创新性地采用累积分布函数计算高斯信道诱导的类别概率,从而将训练目标与采样离散化操作相统一。在QM9和ZINC250k数据集上的实验表明,该方法在结构保真度与生成效率方面均表现出优越性,并实现了性能提升。