Genomic studies, including CRISPR-based PerturbSeq analyses, face a vast hypothesis space, while gene perturbations remain costly and time-consuming. Gene expression models based on graph neural networks are trained to predict the outcomes of gene perturbations to facilitate such experiments. Active learning methods are often employed to train these models due to the cost of the genomic experiments required to build the training set. However, poor model initialization in active learning can result in suboptimal early selections, wasting time and valuable resources. While typical active learning mitigates this issue over many iterations, the limited number of experimental cycles in genomic studies exacerbates the risk. To this end, we propose graph-based one-shot data selection methods for training gene expression models. Unlike active learning, one-shot data selection predefines the gene perturbations before training, hence removing the initialization bias. The data selection is motivated by theoretical studies of graph neural network generalization. The criteria are defined over the input graph and are optimized with submodular maximization. We compare them empirically to baselines and active learning methods that are state-of-the-art on this problem. The results demonstrate that graph-based one-shot data selection achieves comparable accuracy while alleviating the aforementioned risks.
翻译:基因组研究(包括基于CRISPR的PerturbSeq分析)面临巨大的假设空间,而基因扰动实验仍然成本高昂且耗时。为促进此类实验,基于图神经网络的基因表达模型被训练用于预测基因扰动结果。由于构建训练集所需的基因组实验成本较高,通常采用主动学习方法训练这些模型。然而,主动学习中较差的模型初始化可能导致次优的早期数据选择,浪费时间和宝贵资源。虽然典型主动学习方法可通过多次迭代缓解此问题,但基因组研究中有限的实验周期会加剧这一风险。为此,我们提出基于图结构的一次性数据选择方法用于训练基因表达模型。与主动学习不同,一次性数据选择在训练前预先确定基因扰动方案,从而消除初始化偏差。该方法的设计动机源于图神经网络泛化能力的理论研究,其选择标准基于输入图结构定义,并通过次模最大化进行优化。我们将该方法与当前该领域最先进的基线方法及主动学习方法进行实证比较。结果表明,基于图结构的一次性数据选择在取得相当预测精度的同时,有效缓解了上述风险。