We develop a framework for derivative Gaussian process latent variable models (DGP-LVM) that can handle multi-dimensional output data using modified derivative covariance functions. The modifications account for complexities in the underlying data generating process such as scaled derivatives, varying information across multiple output dimensions as well as interactions between outputs. Further, our framework provides uncertainty estimates for each latent variable samples using Bayesian inference. Through extensive simulations, we demonstrate that latent variable estimation accuracy can be drastically increased by including derivative information due to our proposed covariance function modifications. The developments are motivated by a concrete biological research problem involving the estimation of the unobserved cellular ordering from single-cell RNA (scRNA) sequencing data for gene expression and its corresponding derivative information known as RNA velocity. Since the RNA velocity is only an estimate of the exact derivative information, the derivative covariance functions need to account for potential scale differences. In a real-world case study, we illustrate the application of DGP-LVMs to such scRNA sequencing data. While motivated by this biological problem, our framework is generally applicable to all kinds of latent variable estimation problems involving derivative information irrespective of the field of study.
翻译:我们提出了一个导数高斯过程隐变量模型(DGP-LVM)框架,该框架能够利用改进的导数协方差函数处理多维输出数据。这些改进考虑了底层数据生成过程中的复杂性,例如缩放导数、多个输出维度间的信息差异以及输出之间的相互作用。此外,我们的框架通过贝叶斯推理为每个隐变量样本提供了不确定性估计。通过大量模拟实验,我们证明,由于所提出的协方差函数改进,引入导数信息可以显著提高隐变量估计的准确性。该研究的动机源于一个具体的生物学研究问题:如何从单细胞RNA(scRNA)测序数据中估计基因表达的未观测细胞排序及其对应的导数信息(即RNA速度)。由于RNA速度仅是精确导数信息的估计值,导数协方差函数需要能够处理潜在的尺度差异。在一个真实案例研究中,我们展示了DGP-LVM在此类scRNA测序数据上的应用。尽管受此生物学问题启发,我们的框架普遍适用于涉及导数信息的各类隐变量估计问题,且不受研究领域限制。