Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods and corresponding tools have been developed for this task. However, clear guidance is lacking: choosing the right method is the first hurdle a researcher is confronted with. No less challenging than overcoming this so-called method uncertainty is the procedure of preprocessing, from knowing which steps are required to selecting a corresponding approach from the plethora of valid options to create the accepted input object (data preprocessing uncertainty), with clear guidance again being scarce. Here, we provide a practical guide through all steps required to conduct gene set analysis, beginning with a concise overview of a selection of established methods, including GSEA and DAVID. We thereby lay a special focus on reviewing and explaining the necessary preprocessing steps for each method under consideration (e.g. the necessity of a transformation of the RNA-Seq data)-an essential aspect that is typically paid only limited attention to in both existing reviews and applications. To raise awareness of the spectrum of uncertainties, our review is accompanied by an extensive overview of the literature on valid approaches for each step and illustrative R code demonstrating the complex analysis pipelines. It ends with a discussion and recommendations to both users and developers to ensure that the results of gene set analysis are, despite the above-mentioned uncertainties, replicable and transparent.
翻译:基因集分析是一种分析高通量基因表达数据的常用方法,旨在识别在不同条件下表现出显著富集或缺失表达模式的关联基因集合。近年来,针对这一任务已开发出大量方法及相应工具,但缺乏明确的指导:选择正确方法是研究者面临的首要障碍。与克服所谓的方法不确定性同等困难的是预处理流程——从了解所需步骤到从众多有效选项中选取相应方法以创建可接受的输入对象(数据预处理不确定性),而清晰的指导同样稀缺。本文为执行基因集分析所需的所有步骤提供了实用指南,首先简要概述包括GSEA和DAVID在内的一系列成熟方法。我们特别关注审查并解释每种方法所需的预处理步骤(例如RNA-Seq数据转换的必要性)——这一关键方面在现有综述和应用中通常仅得到有限关注。为提高对不确定性范围的认知,我们的综述还附带了关于各步骤有效方法的广泛文献综述,以及演示复杂分析流程的说明性R代码。最后,我们进行了讨论并提出建议,面向用户和开发者,以确保基因集分析的结果尽管存在上述不确定性,仍具可重复性和透明性。