Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the "alignment" problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers to generate molecules with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space. While our focus here is on chemical search, we also obtain excellent results on an AI supervised task for LLM alignment, showing that the method is scalable and general.
翻译:在化学空间中搜索是一个极具挑战性的问题,因为可能分子的数量随原子数量呈组合式增长。基于化合物数据库训练的大型自回归模型已能生成强大的分子,但我们仍缺乏生成具有目标性质分子的稳健策略。这一分子搜索问题与大型语言模型中的“对齐”问题高度相似,尽管对于许多化学任务而言,我们拥有明确且易于评估的奖励函数。本文提出一种名为能量秩对齐(ERA)的算法,该算法利用显式奖励函数构建梯度驱动目标函数,用于优化自回归策略。理论分析表明,该算法与近端策略优化(PPO)和直接偏好优化(DPO)密切相关,但其极小化器收敛至理想吉布斯-玻尔兹曼分布,其中奖励函数充当能量函数。此外,该算法具有高度可扩展性、无需强化学习,且在每次配对偏好观测数量较少时性能优于DPO。我们应用该方法对齐分子Transformer,使其生成具有外部指定性质的分子,发现该算法能稳健实现目标,并在化学空间的不同区域进行搜索。虽然本文聚焦于化学搜索,但我们在面向大语言模型对齐的人工智能监督任务中也获得了优异结果,表明该方法具有可扩展性和通用性。