We develop a framework for derivative Gaussian process latent variable models (DGP-LVM) that can handle multi-dimensional output data using modified derivative covariance functions. The modifications account for complexities in the underlying data generating process such as scaled derivatives, varying information across multiple output dimensions as well as interactions between outputs. Further, our framework provides uncertainty estimates for each latent variable samples using Bayesian inference. Through extensive simulations, we demonstrate that latent variable estimation accuracy can be drastically increased by including derivative information due to our proposed covariance function modifications. The developments are motivated by a concrete biological research problem involving the estimation of the unobserved cellular ordering from single-cell RNA (scRNA) sequencing data for gene expression and its corresponding derivative information known as RNA velocity. Since the RNA velocity is only an estimate of the exact derivative information, the derivative covariance functions need to account for potential scale differences. In a real-world case study, we illustrate the application of DGP-LVMs to such scRNA sequencing data. While motivated by this biological problem, our framework is generally applicable to all kinds of latent variable estimation problems involving derivative information irrespective of the field of study.
翻译:我们开发了一种适用于导数高斯过程潜变量模型(DGP-LVM)的框架,该框架利用改进的导数协方差函数处理多维输出数据。这些改进考虑了底层数据生成过程中的复杂性,例如缩放导数、多个输出维度间的信息差异以及输出间的相互作用。此外,我们的框架通过贝叶斯推断为每个潜变量样本提供不确定性估计。通过大量模拟实验,我们证明由于提出的协方差函数改进,纳入导数信息可显著提升潜变量估计精度。本研究的发展源于一个具体的生物学研究问题:从单细胞RNA测序(scRNA)基因表达数据及其对应的导数信息(即RNA速度)中估计未观测的细胞排序。由于RNA速度仅是精确导数信息的近似值,导数协方差函数需考虑潜在的尺度差异。在实际案例研究中,我们展示了DGP-LVM在此类scRNA测序数据上的应用。尽管受限于这一生物学问题,但我们的框架普遍适用于所有涉及导数信息的潜变量估计问题,不受学科领域限制。