High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.
翻译:高通量药物筛选——利用细胞成像或基因表达测量作为药物效应的读数——是生物技术中评估和理解药物化学结构与生物活性关系的关键工具。由于大规模筛选需要分成多个实验进行,一个主要难点是处理批次效应,这可能会在数据中引入系统误差和非生物学关联。我们提出InfoCORE(一种基于信息最大化的混杂因素消除方法),以有效处理批次效应并获得精炼的分子表示。InfoCORE构建了潜在表示在给定批次标识符条件下条件互信息的变分下界,并通过自适应重加权样本以平衡其隐含的批次分布。在药物筛选数据上的大量实验表明,InfoCORE在分子属性预测和分子-表型检索等多任务中展现出卓越性能。此外,我们展示了InfoCORE如何通过最小化与虚假特征的关联或移除敏感属性,成为一个多功能框架,解决一般性分布偏移和数据公平性问题。代码已开源至https://github.com/uhlerlab/InfoCORE。