Contrastive dimension reduction methods have been developed for case-control study data to identify variation that is enriched in the foreground (case) data X relative to the background (control) data Y. Here, we develop contrastive regression for the setting when there is a response variable r associated with each foreground observation. This situation occurs frequently when, for example, the unaffected controls do not have a disease grade or intervention dosage but the affected cases have a disease grade or intervention dosage, as in autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our contrastive regression model captures shared low-dimensional variation between the predictors in the cases and control groups, and then explains the case-specific response variables through the variance that remains in the predictors after shared variation is removed. We show that, in one single-nucleus RNA sequencing dataset on autism severity in postmortem brain samples from donors with and without autism and in another single-cell RNA sequencing dataset on cellular differentiation in chronic rhinosinusitis with and without nasal polyps, our contrastive linear regression performs feature ranking and identifies biologically-informative predictors associated with response that cannot be identified using other approaches
翻译:对比降维方法已针对病例-对照研究数据开发,用于识别在(病例)前景数据X中相对于(对照)背景数据Y富集的变异。在此,我们针对每个前景观测值存在响应变量r的情况发展了对比回归。这种情况常见于:例如,未受影响的对照组没有疾病分级或干预剂量,而受影响的病例组具有疾病分级或干预剂量(如孤独症严重程度、实体肿瘤分期、息肉大小或华法林剂量)。我们的对比回归模型捕捉病例组与对照组预测变量之间共享的低维变异,然后通过去除共享变异后预测变量中剩余的方差来解释病例特异的响应变量。我们表明,在一个关于死后脑样本(来自孤独症和非孤独症捐赠者)孤独症严重程度的单核RNA测序数据集,以及另一个关于慢性鼻窦炎(伴或不伴鼻息肉)细胞分化的单细胞RNA测序数据集中,我们的对比线性回归能够进行特征排序,并识别出与响应变量相关的生物信息学预测因子,而这些因子无法通过其他方法识别。