The biological roles of gene sets are used to group them into collections. These collections are often characterized by being high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation and study of their content. Bioinformatics looked for solutions to reduce their dimension or increase their intepretability. One possibility lies in aggregating overlapping gene sets to create larger pathways, but the modified biological pathways are hardly biologically justifiable. We propose to use importance scores to rank the pathways in the collections studying the context from a set covering perspective. The proposed Shapley values-based scores consider the distribution of the singletons and the size of the sets in the families; Furthermore, a trick allows us to circumvent the usual exponential complexity of Shapley values' computation. Finally, we address the challenge of including a redundancy awareness in the obtained rankings where, in our case, sets are redundant if they show prominent intersections. The rankings can be used to reduce the dimension of collections of gene sets, such that they show lower redundancy and still a high coverage of the genes. We further investigate the impact of our selection on Gene Sets Enrichment Analysis. The proposed method shows a practical utility in bioinformatics to increase the interpretability of the collections of gene sets and a step forward to include redundancy into Shapley values computations.
翻译:基因集的生物学功能常被用于将其归入不同集合。这些集合通常具有高维性、重叠性和冗余性,导致其内容难以直接解读与研究。生物信息学领域一直在寻求降低其维度或增强可解释性的解决方案。一种方法是将重叠的基因集合并为更大的通路,但修改后的生物学通路往往缺乏生物学上的合理性。我们提出利用重要性评分对集合中的通路进行排序,从集合覆盖的视角研究其语境。基于沙普利值的评分方法考虑了单例分布及集合中集合的规模;此外,通过一种技巧规避了沙普利值计算中常见的指数复杂度问题。最终,我们解决了在排序中引入冗余感知的挑战——本研究中,若集合间存在显著交集,则判定其为冗余。该排序可用于降低基因集集合的维度,使其在保持高基因覆盖度的同时降低冗余性。我们进一步探究了该选择对基因集富集分析的影响。本方法在生物信息学中具有实际应用价值,可提升基因集集合的可解释性,并推动将冗余机制纳入沙普利值计算的进展。