The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that the random features proposed in this work are effective at approximating the Tanimoto coefficient in real-world datasets and that the kernels explored in this work are useful for molecular property prediction and optimization tasks.
翻译:Tanimoto系数常被用于衡量以离散指纹表示的分子之间的相似性,可视为距离度量或正定核。尽管许多核方法可通过随机特征近似实现加速,但目前尚缺乏针对Tanimoto核的此类近似方法。本文提出两种新型随机特征,使该核能够扩展至大规模数据集,并在此过程中发现了该核在实向量上的新型扩展。我们从理论上刻画了这些随机特征的特征,并给出了Gram矩阵谱范数的误差界。实验表明,本文提出的随机特征能有效逼近真实数据集中的Tanimoto系数,且所探索的核函数对分子性质预测与优化任务具有实用价值。