Improving Small Molecule Generation using Mutual Information Machine

We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints (e.g., similarity to a reference molecule). Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space. MolMIM is trained with Mutual Information Machine (MIM) learning, and provides a fixed length representation of variable length SMILES strings. Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the training procedure which promotes a dense latent space, and allows the model to sample valid molecules from random perturbations of latent codes. We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty. We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization. We achieve state-of-the-art results in several constrained single property optimization tasks as well as in the challenging task of multi-objective optimization, improving over previous success rate SOTA by more than 5\% . We attribute the strong results to MolMIM's latent representation which clusters similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favourable in a compute limited regime, making it an attractive model for such cases.

翻译：我们研究了小分子的可控生成任务，即在特定约束条件（如与参考分子的相似性）下寻找具有期望特性的新型分子。本文提出MolMIM，一种用于小分子药物发现的概率自编码器，能够学习信息丰富且聚类良好的潜在空间。MolMIM通过互信息机（Mutual Information Machine, MIM）学习进行训练，可为可变长度的SMILES字符串生成固定长度的表示。针对编码器-解码器模型可能因无效样本产生“空洞”表示的问题，我们提出了一种新颖的训练扩展方法，以促进潜在空间的致密性，使模型能够从潜在编码的随机扰动中采样有效分子。我们通过将MolMIM与多种变长及定长编码器-解码器模型进行全面比较，证明了MolMIM在生成有效性、唯一性和新颖性方面的优越性能。随后，我们利用CMA-ES（一种朴素黑箱无梯度搜索算法）在MolMIM的潜在空间中执行属性导向的分子优化任务。在多个受约束的单属性优化任务以及极具挑战性的多目标优化任务中，我们取得了最先进的结果，成功率较先前最佳水平提升超过5%。我们将这一显著成果归因于MolMIM的潜在表示能够使相似分子在潜在空间中聚类，而CMA-ES通常仅作为基准优化方法。此外，我们还证明了MolMIM在计算资源受限场景下的优越性，使其成为此类场景的理想模型。