Single-cell transcriptomic data approximates the abundance of proteins at a high resolution, but its noisiness necessitates transformation by a pipeline of methods before analysis and inference. In the absence of robust validation of these pipelines and methods, it remains unclear how best to process any particular dataset. To compensate for this, popular visualisation methods, e.g., t-SNE and UMAP, are commonly used to produce descriptions of datasets. Such visualisations are incomplete and provide subjective descriptions of samples rather than statistically meaningful statements about technical noise or biology. In this paper, we introduce the Zero-Inflated Negative-Binomial with Geometric Tail (ZINBGT), a mixture-model-based strategy for producing interpretable visualisations of each gene's expression across cells, along with diagnostic summaries that use Wasserstein distance to highlight outlier genes. These diagnostics are used to reveal an outlier gene within a T. brucei sample. This method is applied to a human immune-cell dataset, highlighting the relationship between sparsity, mean, and spread across genes, as well as revealing an issue with the use of zero-inflated negative-binomial distributions to model single-cell RNA data. An investigation of simulated datasets intended to replicate the immune-cell data revealed discrepancies with the ground truth, establishing purposes for which these simulated datasets are unsuitable. Finally, we list a number of different domains to which this method can be applied.
翻译:[单细胞转录组数据以高分辨率近似蛋白质丰度,但其噪声特性要求在分析和推断之前通过一系列方法进行转换。由于缺乏对这些流程和方法的稳健验证,如何最佳处理特定数据集仍不明确。为弥补这一不足,常用可视化方法(如t-SNE和UMAP)生成数据集描述。此类可视化不完整,仅提供样本的主观描述,而非关于技术噪声或生物意义的统计学有效结论。本文提出带几何尾部的零膨胀负二项分布(ZINBGT),一种基于混合模型的策略,用于生成每个基因在细胞间表达的可解释可视化,以及利用Wasserstein距离突出离群基因的诊断摘要。这些诊断方法用于揭示T. brucei样本中的一个离群基因。该方法被应用于人类免疫细胞数据集,揭示了基因间稀疏性、均值和离散度的关系,并指出使用零膨胀负二项分布建模单细胞RNA数据存在的问题。对旨在复制免疫细胞数据的模拟数据集的调查显示其与真实数据存在差异,明确了这些模拟数据不适合的用途。最后,我们列举了该方法可应用的多个不同领域。]