Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.
翻译:大语言模型因其能够推理来自不同来源和格式的信息,有望加速小分子药物设计。然而,由于缺乏反映真实场景的基准测试,其实际效用仍不明确。在本工作中,我们引入了一套基于化学任务的套件,涵盖分子性质预测、分子表征转换及分子设计。重要的是,我们将这些任务构建为强化学习环境,从而实现了评估与后训练的统一方法。在三个模型系列中,我们发现前沿模型在化学任务上的熟练程度持续提升,但仍存在显著改进空间,尤其是在低数据实验场景中。关键的是,我们证明基于强化学习的后训练可大幅提升性能。一个在我们环境中经过后训练的较小模型,尽管基础模型明显较弱,却能媲美最先进的前沿模型。这为在药物发现中应用大语言模型提供了一条实用路径:通过结合精心设计的评估任务与定向后训练,我们既能阐明关键能力差距,也能填补这些差距。