Given two labeled data-sets $\mathcal{S}$ and $\mathcal{T}$, we design a simple and efficient greedy algorithm to reweigh the loss function such that the limiting distribution of the neural network weights that result from training on $\mathcal{S}$ approaches the limiting distribution that would have resulted by training on $\mathcal{T}$. On the theoretical side, we prove that when the metric entropy of the input data-sets is bounded, our greedy algorithm outputs a close to optimal reweighing, i.e., the two invariant distributions of network weights will be provably close in total variation distance. Moreover, the algorithm is simple and scalable, and we prove bounds on the efficiency of the algorithm as well. Our algorithm can deliberately introduce distribution shift to perform (soft) multi-criteria optimization. As a motivating application, we train a neural net to recognize small molecule binders to MNK2 (a MAP Kinase, responsible for cell signaling) which are non-binders to MNK1 (a highly similar protein). We tune the algorithm's parameter so that overall change in holdout loss is negligible, but the selectivity, i.e., the fraction of top 100 MNK2 binders that are MNK1 non-binders, increases from 54\% to 95\%, as a result of our reweighing. Of the 43 distinct small molecules predicted to be most selective from the enamine catalog, 2 small molecules were experimentally verified to be selective, i.e., they reduced the enzyme activity of MNK2 below 50\% but not MNK1, at 10$\mu$M -- a 5\% success rate.
翻译:给定两个带标签的数据集$\mathcal{S}$和$\mathcal{T}$,我们设计了一种简单高效的贪心算法来重新加权损失函数,使得在数据集$\mathcal{S}$上训练得到的神经网络权重的极限分布,趋近于在数据集$\mathcal{T}$上训练本应得到的极限分布。在理论层面,我们证明当输入数据集的度量熵有界时,我们的贪心算法能够输出接近最优的重加权结果,即两个网络权重的不变分布将在总变差距离上被证明是接近的。此外,该算法简单且可扩展,我们还证明了算法效率的界限。该算法能够有意识地引入分布偏移以执行(软)多准则优化。作为一项激励性应用,我们训练神经网络识别对MNK2(一种负责细胞信号传导的MAP激酶)具有结合能力但对MNK1(一种高度相似的蛋白质)不结合的小分子配体。通过调整算法参数,使得留出损失的整体变化可忽略不计,但选择性——即前100个MNK2结合剂中同时为MNK1非结合剂的比例——从54%提升至95%。在Enamine化合物库预测的最具选择性的43种不同小分子中,有2种小分子经实验验证具有选择性,即在10μM浓度下,它们能将MNK2的酶活性降低至50%以下,但对MNK1无此效果——成功率为5%。