Could dropping a few cells change the takeaways from differential expression?

Differential expression (DE) plays a fundamental role toward illuminating the molecular mechanisms driving a difference between groups (e.g., due to treatment or disease). While any analysis is run on particular cells/samples, the intent is to generalize to future occurrences of the treatment or disease. Implicitly, this step is justified by assuming that present and future samples are independent and identically distributed from the same population. Though this assumption is always false, we hope that any deviation from the assumption is small enough that A) conclusions of the analysis still hold and B) standard tools like standard error, significance, and power still reflect generalizability. Conversely, we might worry about these deviations, and reliance on standard tools, if conclusions could be substantively changed by dropping a very small fraction of data. While checking every small fraction is computationally intractable, recent work develops an approximation to identify when such an influential subset exists. Building on this work, we develop a metric for dropping-data robustness of DE; namely, we cast the analysis in a form suitable to the approximation, extend the approximation to models with data-dependent hyperparameters, and extend the notion of a data point from a single cell to a pseudobulk observation. We then overcome the inherent non-differentiability of gene set enrichment analysis to develop an additional approximation for the robustness of top gene sets. We assess robustness of DE for published single-cell RNA-seq data and discover that 1000s of genes can have their results flipped by dropping <1% of the data, including 100s that are sensitive to dropping a single cell (0.07%). Surprisingly, this non-robustness extends to high-level takeaways; half of the top 10 gene sets can be changed by dropping 1-2% of cells, and 2/10 can be changed by dropping a single cell.

翻译：差异表达分析在揭示组间差异（如处理或疾病引起的差异）的分子机制中具有基础性作用。尽管分析基于特定细胞/样本进行，其目的在于推广至未来出现的治疗或疾病情境。这一步骤基于假设：当前与未来样本独立同分布于同一总体。尽管该假设始终存在偏差，我们期望偏差足够小，使得（A）分析结论仍成立，且（B）标准统计工具（如标准误、显著性、功效）仍能反映泛化能力。相反，若舍弃极小比例数据即可实质性改变结论，则需警惕这些偏差及其对标准工具的依赖。虽然逐一检验所有小比例数据在计算上不可行，近期研究开发了近似方法以识别此类具有影响力的子集是否存在。基于该工作，我们构建了差异表达分析的数据舍弃稳健性指标：具体而言，将分析转化为适用于该近似方法的形式，将近似方法扩展至包含数据依赖超参数的模型，并将数据点的概念从单细胞扩展至伪批量观测。随后，我们克服基因集富集分析固有的不可微性，为前序基因集的稳健性开发了额外近似方法。我们评估了已发表单细胞RNA-seq数据中差异表达的稳健性，发现舍弃<1%的数据即可逆转1000余个基因的结果，其中数百个基因对单细胞（0.07%）的舍弃敏感。令人惊讶的是，这种非稳健性延伸至高层级结论：舍弃1-2%细胞即可改变前10基因集中的半数，而舍弃单细胞即可改变其中2/10的基因集。