Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding -- SmallSA -- for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.
翻译:基于最近邻的相似性搜索是化学中的常见任务,在药物发现中具有显著应用场景。然而,该任务目前最常用的方法仍依赖暴力搜索。实践中,这种方法计算成本高且耗时过长,部分原因在于现代化学数据库的庞大规模。此前该任务的计算改进通常依赖于硬件升级或缺乏泛化能力的数据集特定技巧,而利用低复杂度搜索算法的方法相对未充分探索。但许多此类算法属于近似解,且难以处理典型的高维化学嵌入。本研究评估了低维化学嵌入与k-d树数据结构的组合是否能在保持标准化学相似性搜索基准性能的同时,实现快速最近邻查询。我们针对该任务检验了标准化学嵌入的不同降维方法以及一种基于学习的结构感知嵌入——SmallSA。采用该框架后,在单CPU核心上对超过十亿种化学品的搜索可在不到一秒内完成,比暴力搜索快五个数量级。我们同时证明SmallSA在化学相似性基准测试中展现出具有竞争力的性能。