Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Cancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency.

翻译：癌症通常并非由单一基因突变（即“打击”）引发，而是源于细胞内累积的多重打击组合。然而，随着候选打击基因组合数量的增加（即从约20,000个基因中选取h个的组合数，其中20,000为人类基因组基因总数，h为打击次数），枚举多重打击组合的计算成本呈指数级增长。为应对这一挑战，我们提出了一种称为剪枝深度优先搜索（P-DFS）的算法框架，该框架利用肿瘤突变数据的高稀疏性对搜索空间进行大规模剪枝。具体而言，P-DFS（本文的核心贡献）——一种基于深度优先搜索回溯技术的剪枝方法，通过利用稀疏性显著缩减了加权集合覆盖原本需处理的指数级h-打击候选组合搜索空间，该方法能早期剪除不可行的基因子集，同时结合加权集合覆盖模型系统化地评分并选择最具判别性的组合。通过将这些思想与优化的位运算及高性能计算集群上的可扩展分布式算法相结合，我们的算法在4-打击场景下可实现约90%-98%的访问组合削减，并在147,456个计算节点上相比穷举式集合覆盖方法（其算法复杂度为NP完全）获得约183倍的加速。由此，我们的方法能够有效处理四重乃至更高阶的基因打击，同时实现速度与资源效率的提升。