Given two labeled data-sets $\mathcal{S}$ and $\mathcal{T}$, we design a simple and efficient greedy algorithm to reweigh the loss function such that the limiting distribution of the neural network weights that result from training on $\mathcal{S}$ approaches the limiting distribution that would have resulted by training on $\mathcal{T}$. On the theoretical side, we prove that when the metric entropy of the input datasets is bounded, our greedy algorithm outputs a close to optimal reweighing, i.e., the two invariant distributions of network weights will be provably close in total variation distance. Moreover, the algorithm is simple and scalable, and we prove bounds on the efficiency of the algorithm as well. As a motivating application, we train a neural net to recognize small molecule binders to MNK2 (a MAP Kinase, responsible for cell signaling) which are non-binders to MNK1 (a highly similar protein). In our example dataset, of the 43 distinct small molecules predicted to be most selective from the enamine catalog, 2 small molecules were experimentally verified to be selective, i.e., they reduced the enzyme activity of MNK2 below 50\% but not MNK1, at 10$\mu$M -- a 5\% success rate.
翻译:给定两个带标签的数据集$\mathcal{S}$和$\mathcal{T}$,我们设计了一种简单高效的贪心算法来重新加权损失函数,使得在$\mathcal{S}$上训练得到的神经网络权重的极限分布,逼近在$\mathcal{T}$上训练会产生的权重极限分布。在理论方面,我们证明当输入数据集的度量熵有界时,我们的贪心算法能输出接近最优的重新加权方案,即网络权重的两个不变分布在总变差距离上可证明是接近的。此外,该算法简单且可扩展,我们也证明了其效率的界。作为一个激励性应用,我们训练了一个神经网络来识别对MNK2(一种负责细胞信号传导的MAP激酶)具有结合活性、但对MNK1(一种高度相似的蛋白)无结合活性的小分子。在我们的示例数据集中,从enamine化合物库中预测出的43个最具选择性的不同小分子里,有2个小分子在实验中被验证具有选择性——即在10$\mu$M浓度下,它们将MNK2的酶活性降低至50%以下,但不影响MNK1的活性,成功率为5%。